TrainingarXiv cs.AI — 14 d ago

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

The paper introduces the Independent Combinatorial Tokens (ICT) framework to enhance Large Language Model (LLM) reasoning by addressing optimization instability in Reinforcement Learning with Verifiable Rewards (RLVR). By focusing on the Jensen-Shannon divergence of token logits, ICT identifies critical tokens for targeted updates, leading to improved exploration and stability in training. Empirical results show that applying this method to the Qwen2.5 models (0.5B/1.5B/7B) results in an average pass@4 improvement of 4.58%, highlighting its effectiveness over existing benchmarks in diverse reasoning tasks.

llmreasoningtoken-distributionrelevance 0.00 · engagement 0.00

Read at source ↗← all news