Training
SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR
The paper presents findings on the limitations of selecting the best SFT checkpoint based on pass@1 for GRPO, highlighting issues of entropy collapse in rollout distributions. It examines the performance of Qwen2.5-Coder-3B (3B parameters) and DeepSeek-Coder-6.7B (6.7B parameters) across various depths, revealing that while Qwen's pre RL pass@1 improves with depth, its GRPO performance significantly declines, indicating potential rank inversion issues. The authors propose a diagnostic approach that utilizes pre RL entropy and early GRPO monitoring to identify checkpoints at risk of failure, which is crucial for practitioners to optimize model training and avoid ineffective checkpoints.
sftrank inversionrl