Research
MiniMax Sparse Attention
The article introduces MiniMax Sparse Attention (MSA), a blockwise sparse attention mechanism designed to enhance ultra-long-context capabilities in large language models (LLMs). MSA leverages Grouped Query Attention (GQA) with a lightweight Index Branch for efficient Top-k selection, achieving a 28.4x reduction in per-token attention compute on a 109B-parameter model with 1M context. This method, optimized for GPU execution, demonstrates significant speed improvements, achieving 14.2x prefill and 7.6x decoding speedups, making it a valuable advancement for practitioners needing scalable and efficient attention mechanisms in LLMs.
attentionlong-contextsparsity