Baseline-Free Policy Optimization for Neural Combinatorial Optimization
The paper introduces Group Relative Policy Optimization (GRPO), a baseline-free algorithm for neural combinatorial optimization (NCO) that normalizes advantages within sampled trajectory groups, addressing the instability issues associated with traditional REINFORCE training. Evaluated on the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) benchmarks, GRPO demonstrated superior stability and performance, achieving solution quality within 2% of the POMO baseline without requiring a frozen policy for variance reduction. This approach is significant for practitioners as it mitigates training collapse in complex routing problems, enhancing the robustness of reinforcement learning applications in combinatorial optimization tasks.