Training
Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
The paper presents Bebop, a systematic approach to integrate Multi-Token Prediction (MTP) into reinforcement learning (RL) pipelines, addressing the bottleneck of rollout speed in RL training. It reveals that MTP acceptance rates are limited by model entropy fluctuations and proposes a novel total variation (TV) loss function that optimizes rejection sampling, achieving up to 95% acceptance rates and 1.8x acceleration in asynchronous RL training for the Qwen3.5, Qwen3.6, and Qwen3.7 models. This advancement is significant for practitioners as it enhances the efficiency of RL training in large language models, allowing for faster and more effective deployment in various AI applications.
reinforcement learningllmtraining