Inference
ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
ReasonAlloc is a new framework for hierarchical budget allocation during decoding in large language models, addressing inference bottlenecks caused by key-value (KV) cache growth in long chain-of-thought reasoning. It employs an offline layer-wise preallocation strategy based on a "Reasoning Wave" demand pattern and an online head-wise reallocation strategy to optimize resource use during inference. Evaluations on benchmarks like MATH-500 and AIME 2024 demonstrate that ReasonAlloc significantly outperforms existing methods such as R-KV and SnapKV, particularly at smaller token budgets, making it a valuable addition for practitioners seeking efficient LLM deployment.
llmkv cachereasoning