Research
Rethinking the Role of Efficient Attention in Hybrid Architectures
The paper presents a systematic analysis of hybrid architectures that integrate full attention with efficient attention modules like sliding-window attention (SWA) and recurrent sequence mixers. It reveals that efficient attention affects the speed of long-context capability emergence, while full attention is crucial for long-range retrieval, leading to the phenomenon termed Large-Window Laziness. The study demonstrates that applying NoPE to full-attention layers in a small-window SWA hybrid can significantly enhance long-context performance without compromising short-context capabilities, providing valuable insights for practitioners optimizing model architectures for diverse context lengths.
attentionarchitecturescaling