RAG
LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake
LakeQA is a newly introduced benchmark for search-centric question answering over large data lakes, comprising approximately 9.5 TB of text from diverse sources like Wikipedia and open-source government data. It requires long-horizon multi-hop reasoning and implicit intermediate steps, with each task annotated by Ph.D.-level experts to ensure quality. Experimental results indicate that even advanced models, such as GPT-5.2, struggle with this benchmark, achieving only an 18.37% exact-match score, highlighting the need for improved LLM capabilities in both search and reasoning for practical applications.
question answeringsearchdata lake