Training
3SPO: State-Score-Supervised Policy Optimization for LLM Agents
The paper introduces State-Score-Supervised Policy Optimization (3SPO), a novel reinforcement learning algorithm for training large language models (LLMs) as autonomous agents. 3SPO enables post-step policy optimization using dynamic state score supervision without needing value function estimation, achieving significant improvements in state exploration (2.4x) and convergence speed (1.8x) over the baseline GRPO, with benchmark results showing a 22.6% and 15.6% performance increase on ALFWorld and WebShop, respectively. This approach addresses challenges in multi-turn agent settings by enhancing credit assignment and efficiency, making it a valuable tool for practitioners optimizing LLMs in complex environments.
policy optimizationLLMreinforcement learning