RAG
CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
CacheWeaver is a new method designed to enhance Retrieval-Augmented Generation (RAG) by implementing cache-aware evidence ordering to optimize prompt efficiency. It utilizes a prefix tree to prioritize the most reusable prefixes during evidence retrieval, achieving a 20-33% reduction in median time-to-first-token (TTFT) across various vLLM configurations while maintaining answer quality. This approach is significant for practitioners as it provides a lightweight solution to improve inference speed without altering the underlying serving engine or evidence sets.
RAGcacheinference