InferencearXiv cs.AI — 47 d ago

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

The paper presents a CPU-GPU hybrid system designed to enhance local deployment of large Mixture-of-Experts (MoE) models, achieving cloud-grade service level objectives (SLOs). Key innovations include stream-loading prefill (SLP) that boosts throughput to 1,200 tokens/s for 32K prompts in 30 seconds, and distributed SLP (DSLP) reaching 1,800 tokens/s for 45K prompts using two RTX 5090s. This approach allows for efficient local inference with original-precision, significantly improving throughput and reducing latency, which is critical for practitioners aiming to deploy high-performance AI models without relying on cloud infrastructure.

Mixture-of-Expertsinferencecloudrelevance 0.60 · engagement 0.00

Read at source ↗← all news