Training
GraphPO: Graph-based Policy Optimization for Reasoning Models
GraphPO, a new reinforcement learning framework, introduces a directed acyclic graph representation for rollouts in reasoning models, addressing limitations of traditional Reinforcement Learning with Verifiable Rewards (RLVR). By merging semantically equivalent reasoning paths and reallocating computational resources, GraphPO improves inference efficiency and reduces variance in advantage estimation. Experimental results demonstrate that GraphPO outperforms existing chain- and tree-based methods across multiple benchmarks, making it a valuable tool for practitioners aiming to enhance the performance of large language models in reasoning tasks.
reinforcement-learningpolicy-optimizationreasoning