ResearcharXiv cs.CL — 11 d ago

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

The paper introduces DRA-GRPO, a framework that enhances Group Relative Policy Optimization (GRPO) for mathematical reasoning by addressing the issue of reward diversity in post-training LLMs. It employs Diversity-aware Reward Adjustment (DRA) to calibrate reward signals using Submodular Mutual Information (SMI) and Inverse Propensity Scoring (IPS), promoting diverse reasoning paths. Empirical results show that DRA-GRPO achieves an average accuracy of 58.2% on the DeepSeek-R1-Distill-Qwen-1.5B model with minimal training samples, underscoring the importance of diversity in improving data-efficient alignment for practitioners.

mathematical_reasoningllmreinforcement_learningrelevance 0.00 · engagement 0.00

Read at source ↗← all news