AgentsarXiv cs.AI — 7 d ago

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

The paper introduces Sibling-Guided Credit Distillation (SGCD), a novel approach for enhancing long-horizon tool-use in reinforcement learning by improving credit assignment through dynamic sampling of mixed sibling rollouts. SGCD addresses the limitations of direct token-level self-distillation, which can inadvertently amplify harmful behaviors, by utilizing an external LLM to summarize contrasting outcomes into a stepwise credit reference. The method demonstrates significant performance improvements on benchmarks, with AppWorld TGC scores increasing from 42.9 to 45.6 and pass@1 on the $\tau^3$-airline dataset rising from 0.583 to 0.602, highlighting its potential for practitioners developing advanced RL agents.

reinforcement-learningtool-usecredit-distillationrelevance 0.00 · engagement 0.00

Read at source ↗← all news