Training
Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning
The paper introduces DiRL, a Direction-Aware Reinforcement Learning framework designed to enhance reasoning in large language models by differentiating between reasoning-driven and memorization-driven exploration. DiRL utilizes direction-weighted gradient features derived from model representations to optimize reward structures, promoting genuine reasoning improvements while mitigating memorization. Experimental results on mathematical and general reasoning benchmarks indicate that DiRL significantly outperforms existing exploration methods, offering practitioners a novel approach to enhance LLM reasoning capabilities in reinforcement learning contexts.
reinforcement-learningllmexploration