Inference
PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression
PolyKV introduces a layer-wise KV cache optimization framework that allows for heterogeneous retention and allocation of cache resources in large language models, addressing the limitations of uniform cache policies. By dynamically selecting compression methods and allocating varying budgets based on layer-specific requirements, PolyKV demonstrates significant performance improvements, recovering up to 54.5% of the performance gap compared to traditional single-policy methods on models like LLaMA-3.1-8B and Qwen3-8B. This advancement is crucial for practitioners as it enhances inference efficiency and reduces memory costs in long-context scenarios.
kv-cachecompression