DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
The DeepSeek-V4 series introduces two Mixture-of-Experts (MoE) language models: DeepSeek-V4-Pro with 1.6 trillion parameters (49 billion activated) and DeepSeek-V4-Flash with 284 billion parameters (13 billion activated), both capable of processing contexts up to one million tokens. Key architectural advancements include a hybrid attention mechanism utilizing Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), along with Manifold-Constrained Hyper-Connections (mHC) and the Muon optimizer, which collectively enhance efficiency and stability. This development significantly reduces inference FLOPs and KV cache usage for long-context scenarios, making it a valuable resource for practitioners focusing on large-scale, long-horizon tasks in AI applications.