Agents
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
The paper presents UltraQuant, a 4-bit key-value (KV) caching approach designed for context-heavy agents, which significantly enhances performance in multi-turn workloads. Key technical innovations include TurboQuant-style rotation, asymmetric KV treatment, and optimized serving on AMD GPUs, achieving a 3.47x reduction in time-to-first-token during cache-pressured scenarios and a 1.63x increase in output throughput compared to the FP8 KV baseline. This advancement is crucial for practitioners as it addresses the challenges of long-context processing and resource utilization in concurrent AI systems.
kv-cachecompressioncontext-heavyturboquant