ai-digest.dev
last updated 3 h ago
TrainingarXiv cs.AI 8 d ago

LEPO: Latent Reasoning Policy Optimization for Large Language Models

The article introduces LEPO (Latent Reasoning Policy Optimization), a framework that enhances large language models (LLMs) by integrating controllable stochasticity through Gumbel-Softmax, allowing for diverse reasoning paths. LEPO applies reinforcement learning (RL) directly to continuous latent representations, maintaining stochasticity during rollout for diverse trajectory sampling and constructing a unified gradient estimation in the optimization stage. This approach demonstrates significant performance improvements over existing RL methods in both discrete and latent reasoning tasks, making it a valuable tool for practitioners aiming to enhance LLM capabilities.

latent reasoningllmpolicy optimizationrelevance 0.00 · engagement 0.00
Read at source ↗← all news
LEPO: Latent Reasoning Policy Optimization for Large Language Models — AI News Digest