SafetyarXiv cs.AI — 8 d ago

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

The paper introduces a novel approach termed Safe Trigger, enhancing safety in Large Reasoning Models (LRMs) by leveraging their inherent Latent Safety Awareness. By employing Supervised Fine-Tuning (SFT) to induce safety tags and Direct Preference Optimization (DPO) for stability, the method significantly reduces the Attack Success Rate (ASR) on harmful and jailbreak benchmarks by 24.65% and 36.72% respectively, while maintaining general performance. This advancement is crucial for practitioners focusing on the deployment of safer AI systems, as it allows for adaptive safety measures without compromising user experience.

safetyLLMreasoningrelevance 0.00 · engagement 0.00

Read at source ↗← all news