Research
Distributional Biases in Post-Training: A Markovian Analysis of Reasoning Trajectories
The paper presents a Markovian analysis of reasoning trajectories in foundation models, focusing on the impact of post-training strategies like Reinforcement Learning with Verifiable Rewards (RLVR) and test-time scaling (TTS). It reveals that these strategies tend to reinforce existing reasoning paths rather than expand them, which paradoxically raises questions about the effectiveness of exploration. The authors propose that rejecting easier reasoning tasks and employing KL regularization can help preserve rare but important chains of thought (CoTs), supported by both theoretical proofs and empirical simulations, highlighting implications for improving task-specific reasoning in AI systems.
reasoningpost-trainingrlmodels