Inference
LoLA: Low-Rank Linear Attention With Sparse Caching
The paper introduces LoLA, a training-free augmentation to linear attention designed for improved associative recall in transformer models, particularly for lifelong in-context learning. LoLA employs a novel memory architecture that utilizes a local sliding window cache, a sparse global cache, and a recurrent hidden state, achieving a significant performance enhancement from 0.6% to 97.4% accuracy on pass-key retrieval tasks with a 4.6x smaller cache compared to Llama-3.1 8B at a 4K context length. This advancement is particularly relevant for practitioners as it enables more efficient memory management in large language models, enhancing their applicability in real-world scenarios requiring long-term memory.
linear_attentionlong_context