ResearcharXiv cs.AI — 11 d ago

The Price of Anarchy in Disaggregated Inference

This article presents a game-theoretic analysis of disaggregated inference architectures, specifically using NVIDIA Dynamo as a case study. It models the interactions between prefill and decode GPU pools as three coupled games, revealing how GPU saturation affects the Price of Anarchy (PoA) and proposing an adaptive controller to optimize routing parameters in real-time. The findings demonstrate significant improvements in PoA-hat and tail latency for models Nemotron-4-340B and Llama-3.1-70B, particularly under saturated conditions, highlighting the importance of adaptive strategies for optimizing resource allocation in LLM inference.

inferencearchitecturegame-theoryagentsrelevance 0.00 · engagement 0.00

Read at source ↗← all news