RAG
Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation
The paper introduces DICE (Document Inference via Chunk Evidence), a novel approach for long-document retrieval that addresses failures in dense retrieval caused by document-side early compression. By independently encoding document chunks and aggregating them into a single vector, DICE significantly improves retrieval performance on long documents, achieving notable gains on benchmarks like Dream and Needle for slices exceeding 4k tokens. This method demonstrates that enhancing document-level encoding can effectively mitigate evidence dilution, making it a valuable strategy for practitioners working with long-form content in retrieval systems.
retrievallong-documentsevidence