Training
STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
The article introduces STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), a novel approach designed to address policy entropy collapse in Reinforcement Learning with Verifiable Rewards algorithms like GRPO. STARE performs a first-order gradient analysis to identify entropy-critical token subsets and applies a closed-loop gate for entropy regulation, achieving stable training across models ranging from 1.5B to 32B parameters. Benchmarked against DAPO, STARE demonstrates a 4%-8% improvement in average accuracy across various task families, highlighting its potential to enhance exploration-exploitation balance in RL training.
llmsamplinginference