Training
LEPO: Latent Reasoning Policy Optimization for Large Language Models
The article introduces LEPO (Latent Reasoning Policy Optimization), a framework that enhances large language models (LLMs) by integrating controllable stochasticity through Gumbel-Softmax, allowing for diverse reasoning paths. LEPO applies reinforcement learning (RL) directly to continuous latent representations, maintaining stochasticity during rollout for diverse trajectory sampling and constructing a unified gradient estimation in the optimization stage. This approach demonstrates significant performance improvements over existing RL methods in both discrete and latent reasoning tasks, making it a valuable tool for practitioners aiming to enhance LLM capabilities.
latent reasoningllmpolicy optimization