InferencearXiv cs.AI — 2 d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4 introduces Lookahead Sparse Attention (LSA), a novel inference method that reduces GPU memory usage for ultra-long context in large language models by predicting future context needs and retaining only essential key-value (KV) pairs. This architecture, implemented with a backbone-free decoupled training strategy, achieves a 13.5% reduction in average KV cache footprint across various long-context benchmarks while maintaining or slightly improving accuracy, and at 500K token scales, it reduces KV cache overhead by over 90%. This advancement is significant for practitioners as it enhances serving efficiency and reduces resource requirements without compromising model performance.

contextattentionllmmemoryrelevance 0.00 · engagement 0.00

Read at source ↗← all news