Daily digest — 2026-06-25

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4 introduces Lookahead Sparse Attention (LSA), a novel inference method that reduces GPU memory usage for ultra-long context in large language models by predicting future context needs and retaining only essential key-value (KV) pairs. This architecture, implemented with a backbone-free decoupled training strategy, achieves a 13.5% reduction in average KV cache footprint across various long-context benchmarks while maintaining or slightly improving accuracy, and at 500K token scales, it reduces KV cache overhead by over 90%. This advancement is significant for practitioners as it enhances serving efficiency and reduces resource requirements without compromising model performance.

arXiv cs.AI — 16 d ago · found 14 d agoInference1 · 1 cmts

OPRD: On-Policy Representation Distillation

The paper presents On-Policy Representation Distillation (OPRD), a novel approach that enhances on-policy distillation by aligning student and teacher representations in hidden-state space across selected layers, rather than relying solely on output probabilities. This method significantly reduces sampling variance and improves training efficiency, achieving a 1.44x speedup and 54% lower memory usage compared to top-k on-policy distillation methods. OPRD demonstrates superior performance on benchmarks such as AIME 2024/2025 and AIMO, making it a valuable technique for practitioners aiming to improve model training and performance in large language models.

arXiv cs.AI — 16 d ago · found 14 d agoTraining2 · 0 cmts

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

The article presents a two-stage methodology for deploying Llama-3.2-1B and other decoder-only LLMs on AMD's XDNA 2 NPU, transitioning from human-guided development to an autonomous agent skill system. The initial deployment of Llama-3.2-1B achieved a 2.2x speedup on prefill and a 4.0x speedup on decode compared to a hand-optimized baseline. This approach enables the efficient end-to-end deployment of multiple models with minimal human intervention, demonstrating competitive performance and functional generalization, which is significant for practitioners working on optimizing LLMs for edge inference on resource-constrained hardware.

arXiv cs.AI — 16 d ago · found 14 d agoProducts

The day in AI, distilled.

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

OPRD: On-Policy Representation Distillation

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

Models & Releases

Research

Tooling & Open Source

Safety & Security