TrainingarXiv cs.AI — 21 h ago

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

The paper introduces Cumulative Prefix-divergence Policy Optimization (CPPO), a novel reinforcement learning approach that addresses the limitations of existing PPO-style trust-region methods in large language models (LLMs) by implementing a token-level masking rule. CPPO utilizes a position-weighted threshold and a cumulative prefix budget to manage token-level deviations dynamically, enhancing training stability and improving reasoning accuracy across various model scales. This advancement is significant for practitioners as it provides a more nuanced mechanism for managing autoregressive generation and mitigating compounding errors in LLMs.

reinforcement learningpolicy optimizationLLMrelevance 0.00 · engagement 0.00

Read at source ↗← all news