Inference
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
The article introduces Top-Theta (Top-$\theta$) Attention, a method for sparsifying transformer attention during inference without retraining, utilizing calibrated per-head thresholds to maintain a constant number of significant elements per attention row. This approach achieves a 3-10x reduction in V-cache usage and up to 10x fewer attention elements, with a maximum accuracy degradation of 1% across various natural language processing tasks. This technique offers a practical alternative to traditional top-k attention, making it significant for practitioners aiming to optimize resource usage in transformer models.
transformerssparsityattention