Inference
Service-Induced Congestion in Memory-Constrained LLM Serving
The article presents a discrete-time dynamical model addressing service-induced congestion in memory-constrained large language model (LLM) inference, highlighting how GPU memory usage escalates with concurrent requests due to key-value cache growth. It identifies both eviction-free fixed points and limit cycles under high concurrency, revealing that eviction-free equilibria are generally unstable, leading to throughput losses of up to 50% in homogeneous workloads. The findings emphasize the importance of workload heterogeneity and scheduling design principles to maintain high throughput in LLM serving, which is critical for practitioners managing resource allocation and performance in AI applications.
llmmemoryserving