InferencearXiv cs.AI — 11 d ago

Towards Distributed Inference of LLMs on a P2P Network

This article presents a decentralized, prefix-cache-aware routing scheme for peer-to-peer (P2P) serving of large language models (LLMs), addressing the challenge of partitioned KV caches across nodes. The proposed method utilizes local radix trees for cached prefixes and employs asynchronous updates of peer cache estimates, enhancing inference latency without central coordination. Evaluation on simulated MMLU workloads indicates that this approach effectively reduces latency under optimal conditions, although network latency and distribution skew can diminish its advantages, highlighting important considerations for practitioners in distributed LLM deployment.

llmdecentralizedroutingcacherelevance 0.00 · engagement 0.00

Read at source ↗← all news