AgentsarXiv cs.AI — 8 d ago

APPO: Agentic Procedural Policy Optimization

The paper introduces Agentic Procedural Policy Optimization (APPO), a novel approach to reinforcement learning that enhances the multi-turn tool-use capabilities of large language model agents by refining credit assignment and branching strategies. APPO utilizes a Branching Score that integrates token uncertainty with policy-induced likelihood gains to select branching locations, allowing for more effective exploration and credit distribution across decision points. Experimental results demonstrate that APPO outperforms existing agentic RL baselines by nearly 4 points across 13 benchmarks, offering improved efficiency in tool calls and interpretability of agent behavior, which is crucial for practitioners developing advanced RL systems.

reinforcement learningtool-usecredit assignmentrelevance 0.00 · engagement 0.00

Read at source ↗← all news