Training
Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
The paper introduces dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$), a novel approach to mitigate the autoregressive curse in large language models (LLMs) during long-horizon reasoning tasks. By leveraging intrinsic epistemic uncertainty and segment-level adaptive thresholds, $\text{E}^3\text{RL}$ allows for localized error correction while maintaining linear memory overhead. Trained on the DeepMath-103k dataset, the 4B and 8B parameter models demonstrate significant performance improvements on mathematical reasoning benchmarks, surpassing previous state-of-the-art results by 5.349% and 6.514%, respectively, highlighting its potential for enhancing reasoning capabilities in AI systems.
reinforcement learningepistemicllm