ai-digest.dev
last updated 3 h ago
InferencearXiv cs.AI 12 d ago

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

The article introduces Top-Theta (Top-$\theta$) Attention, a method for sparsifying transformer attention during inference without retraining, utilizing calibrated per-head thresholds to maintain a constant number of significant elements per attention row. This approach achieves a 3-10x reduction in V-cache usage and up to 10x fewer attention elements, with a maximum accuracy degradation of 1% across various natural language processing tasks. This technique offers a practical alternative to traditional top-k attention, making it significant for practitioners aiming to optimize resource usage in transformer models.

transformerssparsityattentionrelevance 0.00 · engagement 0.00
Read at source ↗← all news
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding — AI News Digest