TrainingarXiv cs.CL — 8 d ago

Sub-Token Routing for KV Cache Compression

The article introduces a novel KV-cache compression technique called sub-token routing, which enhances the efficiency of transformer inference by allowing finer control within retained tokens. This method splits each retained value vector into groups and selectively retains only certain groups, improving token-level reduction performance in both large language models (LLMs) like LLaMA-2-7B and multimodal models like Qwen2.5-7B. The findings indicate that this approach is particularly beneficial at smaller KV budgets, offering a complementary strategy to existing token-level reduction methods, thus enabling practitioners to optimize memory usage during inference in resource-constrained environments.

kv-cachecompressionllmtransformersrelevance 0.00 · engagement 0.00

Read at source ↗← all news