InferencearXiv cs.AI — 7 d ago

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Prism is a memory-centric framework designed for cost-efficient multi-LLM serving, utilizing memory ballooning to enhance resource allocation across models. It addresses the inefficiencies of existing sharing approaches by enabling elastic memory allocation, allowing for both spatial and temporal sharing to adapt to dynamic usage patterns. The kvcached balloon driver has been open-sourced and is already deployed across over 10,000 GPUs, which is significant for practitioners aiming to optimize resource utilization in LLM inference.

llmgpumemoryrelevance 0.00 · engagement 0.00

Read at source ↗← all news