Research
Understanding BigBird's Block Sparse Attention
BigBird introduces a block sparse attention mechanism that reduces the quadratic complexity of traditional attention to linear complexity. By utilizing a combination of global tokens, local windows, and block sparse attention patterns, BigBird can effectively handle sequences of length up to 8,192 tokens. This advancement enables practitioners to train models on longer sequences efficiently, making it particularly useful for tasks in natural language processing that require context from extensive input data.
bigbirdattention