Understanding Diversity Collapse in RLVR via the Lens of Overtraining
The paper introduces a formalization of the phenomenon known as "diversity collapse" in Reinforcement Learning with Verifiable Rewards (RLVR), attributing it to overtraining where updates no longer expand the model's capabilities but concentrate on favored trajectories. The authors propose a novel approach called Bayesian Boundary Gating (BBG) to mitigate this issue by estimating the marginal contribution of each problem to the reasoning boundary, resulting in improved performance across various reasoning benchmarks, particularly enhancing high-$k$ Pass@$k$ metrics. This work is significant for practitioners as it offers a method to optimize training processes in RLVR, potentially leading to more robust reasoning capabilities in large language models.