Research
A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions
The article presents a first-principles derivation of policy optimization for language models, focusing on the objective function \( J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[R(\tau)] \). It systematically categorizes policy gradient methods, including REINFORCE, PPO, and GRPO, along two axes: trajectory probability \( p_\theta(\tau) \) and reward \( R(\tau) \), highlighting the rationale behind design choices and identifying compound failures that necessitate joint modifications. This unified framework aids practitioners in diagnosing and addressing limitations in existing LLM optimization algorithms, guiding the development of more effective methods.
policy optimizationLLMreinforcement learning