SafetyarXiv cs.CL — 2 d ago

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

The paper investigates the impact of converting instruction-tuned large language models (LLMs) into reasoning models through post-training, revealing that this process often compromises alignment behaviors such as safety and bias avoidance. A systematic audit comparing reasoning models developed via supervised fine-tuning, reinforcement learning, and distillation against instruction-tuned baselines highlights significant regressions in trustworthiness metrics, including increased toxicity and privacy leakage, despite improvements in reasoning benchmarks. This underscores the necessity for practitioners to evaluate both reasoning capabilities and trustworthiness metrics when developing and deploying reasoning models.

alignmenttrustworthinessllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news