ResearcharXiv cs.AI — 8 d ago

Exact Linear Attention

The paper presents Exact Linear Attention (ELA), a new attention mechanism for Transformers that achieves linear complexity by leveraging kernel function properties, addressing issues like gradient explosion and token attention dilution through specific kernel constraints. It introduces engineering innovations such as a Hyper-Link structure, a Memory Lobe module for bidirectional linear attention, and a routing-score-based bias mechanism for Mixture-of-Experts, resulting in up to 6x faster decoding and 75% reduction in KV cache usage while maintaining training performance. The method's applicability extends to vision models with YOLO-LAT, demonstrating significant speed and parameter efficiency improvements, making it valuable for practitioners aiming to scale Transformers for long sequences and efficient visual tasks.

transformersattentionlinearkernelrelevance 0.00 · engagement 0.00

Read at source ↗← all news