Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
The paper introduces SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a method that enhances sliding-window attention models for mathematical reasoning tasks through a two-stage process involving supervised fine-tuning and reinforcement learning. Experiments demonstrate that while sliding-window attention initially underperforms compared to self-attention, the application of on-policy reinforcement learning significantly improves its performance by optimizing data trajectories to better align with the model's architecture. This advancement is crucial for practitioners as it offers a more efficient alternative to traditional self-attention models while maintaining competitive accuracy in long-context reasoning tasks.