Training
On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents
The paper introduces Guided On-Policy Distillation (Guided-OPD), an algorithm designed to enhance the performance of smaller multi-turn agents by mitigating the compounding errors that arise during on-policy distillation. By mixing teacher- and student-generated turns and gradually reducing teacher intervention, the approach maintains trajectory alignment with the teacher's state distribution. Evaluated on ALFWorld, ScienceWorld, and WebShop, Guided-OPD demonstrates significant improvements, achieving a 21.1% increase in Score and a 25.5% increase in Success Rate when distilling Qwen3 students from a Qwen3-30B-A3B teacher, indicating its potential for reducing inference costs while maintaining performance in practical applications.
agentsdistillationmulti-turn