Inference
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design
The paper presents a CPU-GPU hybrid system designed to enhance local deployment of large Mixture-of-Experts (MoE) models, achieving cloud-grade service level objectives (SLOs). Key innovations include stream-loading prefill (SLP) that boosts throughput to 1,200 tokens/s for 32K prompts in 30 seconds, and distributed SLP (DSLP) reaching 1,800 tokens/s for 45K prompts using two RTX 5090s. This approach allows for efficient local inference with original-precision, significantly improving throughput and reducing latency, which is critical for practitioners aiming to deploy high-performance AI models without relying on cloud infrastructure.
Mixture-of-Expertsinferencecloud