Inference
OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
The article introduces OBCache, a novel framework for optimizing key-value (KV) cache pruning in large language models (LLMs) to enhance long-context inference efficiency. By applying the Optimal Brain Damage theory, OBCache quantitatively evaluates token saliency based on its impact on attention outputs, leading to improved eviction strategies that consider both attention weights and output information. Experimental results on LLaMA and Qwen models show that OBCache significantly enhances long-context accuracy compared to traditional heuristic methods, making it a valuable tool for practitioners seeking to optimize memory usage in LLM applications.
llminferencecacheoptimization