Training
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
The paper introduces Dropout-GRPO, a method that enhances Group Relative Policy Optimization (GRPO) for continuous latent reasoning models like Coconut by incorporating structured dropout to introduce stochasticity in latent states. This approach employs a constant Bernoulli mask across latent recurrence steps, enabling the generation of diverse trajectories essential for effective GRPO and yielding a pass@1 improvement from 27.29% to 29.01% on the GSM8K benchmark. This advancement is significant for practitioners as it provides a theoretically justified and practical mechanism to improve the performance of latent-reasoning LLMs post-training.
reinforcement learningpolicy optimization