TrainingarXiv cs.AI — 21 h ago

Baseline-Free Policy Optimization for Neural Combinatorial Optimization

The paper introduces Group Relative Policy Optimization (GRPO), a baseline-free algorithm for neural combinatorial optimization (NCO) that normalizes advantages within sampled trajectory groups, addressing the instability issues associated with traditional REINFORCE training. Evaluated on the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) benchmarks, GRPO demonstrated superior stability and performance, achieving solution quality within 2% of the POMO baseline without requiring a frozen policy for variance reduction. This approach is significant for practitioners as it mitigates training collapse in complex routing problems, enhancing the robustness of reinforcement learning applications in combinatorial optimization tasks.

policy optimizationreinforcement learningNCOrelevance 0.00 · engagement 0.00

Read at source ↗← all news