Safety
Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models
The paper introduces a novel approach termed Safe Trigger, enhancing safety in Large Reasoning Models (LRMs) by leveraging their inherent Latent Safety Awareness. By employing Supervised Fine-Tuning (SFT) to induce safety tags and Direct Preference Optimization (DPO) for stability, the method significantly reduces the Attack Success Rate (ASR) on harmful and jailbreak benchmarks by 24.65% and 36.72% respectively, while maintaining general performance. This advancement is crucial for practitioners focusing on the deployment of safer AI systems, as it allows for adaptive safety measures without compromising user experience.
safetyLLMreasoning