Training
Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning
The paper introduces Cumulative Prefix-divergence Policy Optimization (CPPO), a novel reinforcement learning approach that addresses the limitations of existing PPO-style trust-region methods in large language models (LLMs) by implementing a token-level masking rule. CPPO utilizes a position-weighted threshold and a cumulative prefix budget to manage token-level deviations dynamically, enhancing training stability and improving reasoning accuracy across various model scales. This advancement is significant for practitioners as it provides a more nuanced mechanism for managing autoregressive generation and mitigating compounding errors in LLMs.
reinforcement learningpolicy optimizationLLM