ai-digest.dev
last updated 5 h ago
InferencearXiv cs.AI 21 h ago

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

The paper presents a CPU-GPU hybrid system designed to enhance local deployment of large Mixture-of-Experts (MoE) models, achieving cloud-grade service level objectives (SLOs). Key innovations include stream-loading prefill (SLP) that boosts throughput to 1,200 tokens/s for 32K prompts in 30 seconds, and distributed SLP (DSLP) reaching 1,800 tokens/s for 45K prompts using two RTX 5090s. This approach allows for efficient local inference with original-precision, significantly improving throughput and reducing latency, which is critical for practitioners aiming to deploy high-performance AI models without relying on cloud infrastructure.

Mixture-of-Expertsinferencecloudrelevance 0.00 · engagement 0.00
Read at source ↗← all news
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design — AI News Digest