TrainingarXiv cs.AI — 12 d ago

Beyond the Sampled Token: Preserving Candidate Support in RLVR

The paper presents a method called Candidate-aware Support Preservation (CaSP) to address exploration collapse in reinforcement learning with verifiable rewards (RLVR). CaSP enhances exploration by redistributing positive gradients among the top-$N$ candidates and imposing a stronger penalty on the top-$1$ candidate when incorrect, leading to improved performance across multiple benchmarks while maintaining effective exploration. This approach is applicable to large models with up to 32 billion parameters and sampling budgets of $K=1024$, offering a principled solution for practitioners focusing on candidate diversity in RLVR scenarios.

reinforcement-learningexplorationcandidate-distributionrelevance 0.00 · engagement 0.00

Read at source ↗← all news