Research
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization
HydraHead is a novel architecture that hybridizes Full Attention (FA) and Linear Attention (LA) along the head axis, addressing the quadratic complexity of attention in long-context processing. Key innovations include an interpretability-driven selection strategy for critical heads and a scale-normalized fusion module, enabling HydraHead to outperform existing hybrid models in long-context tasks with minimal training overhead. Trained on 15 billion tokens, HydraHead demonstrates a 69% performance improvement at a 512K context length, showcasing the potential of head-level hybridization for enhancing model efficiency and scalability.
attentionhybridizationarchitecture