RAGarXiv cs.CL — 14 d ago

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

The paper introduces DICE (Document Inference via Chunk Evidence), a novel approach for long-document retrieval that addresses failures in dense retrieval caused by document-side early compression. By independently encoding document chunks and aggregating them into a single vector, DICE significantly improves retrieval performance on long documents, achieving notable gains on benchmarks like Dream and Needle for slices exceeding 4k tokens. This method demonstrates that enhancing document-level encoding can effectively mitigate evidence dilution, making it a valuable strategy for practitioners working with long-form content in retrieval systems.

retrievallong-documentsevidencerelevance 0.00 · engagement 0.00

Read at source ↗← all news