Research
DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning
The paper introduces DRA-GRPO, a framework that enhances Group Relative Policy Optimization (GRPO) for mathematical reasoning by addressing the issue of reward diversity in post-training LLMs. It employs Diversity-aware Reward Adjustment (DRA) to calibrate reward signals using Submodular Mutual Information (SMI) and Inverse Propensity Scoring (IPS), promoting diverse reasoning paths. Empirical results show that DRA-GRPO achieves an average accuracy of 58.2% on the DeepSeek-R1-Distill-Qwen-1.5B model with minimal training samples, underscoring the importance of diversity in improving data-efficient alignment for practitioners.
mathematical_reasoningllmreinforcement_learning