Agents
APPO: Agentic Procedural Policy Optimization
The paper introduces Agentic Procedural Policy Optimization (APPO), a novel approach to reinforcement learning that enhances the multi-turn tool-use capabilities of large language model agents by refining credit assignment and branching strategies. APPO utilizes a Branching Score that integrates token uncertainty with policy-induced likelihood gains to select branching locations, allowing for more effective exploration and credit distribution across decision points. Experimental results demonstrate that APPO outperforms existing agentic RL baselines by nearly 4 points across 13 benchmarks, offering improved efficiency in tool calls and interpretability of agent behavior, which is crucial for practitioners developing advanced RL systems.
reinforcement learningtool-usecredit assignment