Inference
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
The article presents a new approach to optimizing inference for large language models (LLMs) by addressing the challenges of endogenous memory growth during token-by-token processing. It introduces two algorithms, WAIT and Nested WAIT, which utilize a fluid model to manage GPU scheduling under memory constraints, particularly for varying output lengths. In simulations with the Llama-2-7B model on an A100 GPU, these algorithms demonstrated improved stability and reduced latency compared to existing methods, which is critical for practitioners aiming to enhance the efficiency and cost-effectiveness of LLM deployments.
LLMinferencescheduling