Inference
RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
RKSC (Reasoning-Aware KV Cache Sharing) is a training-free inference framework designed to optimize multi-branch LLM reasoning pipelines by eliminating structural redundancies. It employs ASKS for efficient KV cache sharing based on hidden-state cosine similarity, CGEE for confidence-gated early exits during inference, and RSBCM to manage cache growth effectively, achieving an average speedup of 3.008x over No-KV baselines across five model families (7B-10B) and various benchmarks. This framework allows practitioners to enhance inference efficiency without requiring model fine-tuning or architectural modifications, making it a valuable tool for optimizing LLM deployments.
KV cacheLLMmulti-step