RAG — AI news — AI News Digest

Mistral's new OCR model beats competitors in 72 percent of blind test cases, company says

Mistral AI has announced the release of OCR 4, an optical character recognition model designed to extract text from various document formats, including PDFs, Word files, and PowerPoint presentations. The model reportedly outperforms competitors in 72% of blind test cases, indicating its superior accuracy and effectiveness in real-world applications. This advancement is significant for practitioners in AI and LLMs, as it enhances text extraction capabilities, which are critical for data processing and analysis in diverse applications.

The Decoder33 d agofound 12 d ago#ocr#mistral#competitors

Quantifying Prior Dominance in RAG Systems

The paper introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, addressing limitations in current heuristics that conflate contextual information extraction with memory recall. It analyzes models ranging from 1.5B to 72B parameters, revealing that Small Language Models (SLMs) can outperform larger architectures in strict factual extraction, highlighting diminishing returns in scaling. The study emphasizes the importance of model architecture and alignment, noting significant issues with commercial APIs, including negative transfer and reliance on parametric priors over external evidence, which could inform practitioners about the effectiveness of different model sizes in RAG workflows.

arXiv cs.AI33 d agofound 10 d ago#rag#retrieval-augmented generation#contextual information

A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

This study introduces the Gut-VLM dataset, a benchmark for hallucination detection in vision-language models (VLMs) specifically for gastrointestinal endoscopy, comprising 4,392 test VQA pairs evaluated across five models: MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B. The evaluation of nine detection methods reveals that ReXTrust, a white-box method, achieves the highest average AUC of 93.0 on MedGemma-4B, significantly outperforming alternatives, while highlighting the challenge of "confident confabulation" as a common failure mode. This benchmark is crucial for practitioners as it addresses the safety concerns of deploying VLMs in clinical settings, providing insights into effective detection strategies.

arXiv cs.AI33 d agofound 10 d ago#hallucination detection#vlms#gastrointestinal

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

The article introduces a novel approach called Retrieval-Constrained LLM Adjudication for authoring clinical value sets, addressing limitations in zero-shot LLM generation for clinical code systems. By utilizing a Qwen3-based retrieval mechanism with vocabulary-aware expansion, the candidate-pool recall improved from 0.553 to 0.730, while the integration of GPT-5 for adjudication significantly enhanced macro F1 scores from 0.287 to 0.549. This method is crucial for practitioners as it demonstrates a reliable framework for improving the accuracy and safety of clinical code retrieval in quality measurement and decision support applications.

arXiv cs.AI33 d agofound 10 d ago#clinical_value_sets#retrieval_augmented

T2D-Bench: Evidence-Gated Evaluation of LLM Outputs for Type 2 Diabetes Using a Multi-Layer Clinical-Lifestyle Knowledge Graph

T2D-Bench is a new benchmark and evaluation framework designed to assess the compliance of large language model (LLM) outputs with explicit clinical guidelines for type 2 diabetes, utilizing a multi-layer clinical-lifestyle knowledge graph. It integrates biomedical data sources and ADA Standards of Care rules to evaluate LLM performance against 100 structured vignettes, revealing that baseline outputs from models like GPT-4o-mini and GPT-4 failed evidence-path checks in 35% and 33% of cases, respectively. This framework enables practitioners to identify and rectify unsupported clinical omissions in LLM-generated recommendations, enhancing their reliability in medical contexts.

arXiv cs.AI33 d agofound 12 d ago#LLM#benchmark#type 2 diabetes#evaluation

Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

This research introduces the first public multi-modal dataset of 100 aligned audio-transcript pairs specifically for Turkish phone call scams, addressing the scarcity of annotated data in low-resource languages. The study evaluates seven large language models, including Gemini 2.5, GPT-4o, and Qwen, across different input conditions, finding that transcript-based inputs consistently outperform direct audio processing. This work underscores the necessity for culturally inclusive AI safety measures and more effective multi-modal systems in combating fraud in underrepresented languages.

arXiv cs.AI33 d agofound 10 d ago#scam-detection#audio#llm

Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity

The article presents a multi-agent framework for privacy-preserving Retrieval-Augmented Generation (RAG) that sanitizes retrieved content through semantic rewriting, effectively mitigating privacy leakage in sensitive applications. The framework employs three specialized agents for privacy extraction, semantic analysis, and reconstruction, achieving a reduction in targeted information exposure from 144 instances to just 1 in the LLaMA-3-8B model, while maintaining a BLEU-1 score of 0.122, surpassing the SAGE method. This approach introduces no additional latency for online inference, as the rewriting occurs in a one-time offline preprocessing step, making it practical for deployment in real-world scenarios.

arXiv cs.AI33 d agofound 10 d ago#privacy#semantic-rewriting#multi-agent

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

ReMMD introduces a framework for multimodal misinformation detection that addresses the limitations of existing benchmarks by incorporating realistic scenarios with multilingual narratives and multiple images. The framework includes ReMMDBench, a benchmark with 500 samples and various veracity and distortion labels, and ReMMD-Agent, which utilizes persistent memory to improve evidence verification and achieve a five-way veracity accuracy of 41.80% using GPT-5.2, while significantly reducing operational costs compared to previous agents. This advancement is crucial for practitioners as it enhances the detection of complex misinformation across diverse formats and languages.

arXiv cs.AI33 d agofound 12 d ago#multimodal#misinformation detection#verification#framework

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

MMed-Bench-IR is a newly introduced benchmark for multilingual medical information retrieval, addressing the need for cross-lingual alignment, concept discrimination, and evidence retrieval across six languages. It comprises three tasks: cross-lingual medical QA retrieval with 6,127 queries based on the Unified Medical Language System, concept discrimination using 4,975 confusion sets, and multilingual evidence retrieval with 2,040 queries, all designed without overlap to accurately assess capabilities. The benchmark highlights significant performance gaps in biomedical encoders, with nDCG@10 scores dropping from 0.818 in English to 0.056 in Japanese, underscoring the limitations of existing English-only benchmarks for evaluating multilingual systems in clinical contexts.

arXiv cs.AI33 d agofound 10 d ago#multilingual retrieval#benchmark#medical information

Mistral OCR 4 Brings Citation-Ready Structured Output to RAG, Agentic, and Enterprise Search Pipelines

Mistral AI released OCR 4, transitioning from basic text extraction to providing structured document outputs that include bounding boxes, typed classifications, and confidence scores at both page and word levels. This model supports 170 languages and operates within a single self-hosted container, enabling citation-ready inputs for retrieval-augmented generation (RAG), agentic, and enterprise search pipelines via a unified API endpoint. This enhancement is significant for practitioners as it facilitates more accurate and context-rich information retrieval in AI applications.

MarkTechPost33 d agofound 12 d ago#ocr#structured_output#mistral

MMGist: A Comprehensive Multimodal Benchmark for 2027

The article introduces MMGist, a new multimodal benchmark designed to address limitations in existing vision-language benchmarks. It comprises 7,262 curated items across seven capability dimensions, developed through a rigorous three-stage filtering process, and has been tested on 27 leading large vision-language models (LVLMs). MMGist demonstrates high fidelity in preserving model rankings (Spearman $\rho = 0.98$) while significantly reducing the number of evaluation items by 69% and enhancing cross-model discrimination by 78%, highlighting the importance of visual dependency and discriminative power in evaluating LVLM performance.

arXiv cs.AI34 d agofound 15 d ago#benchmark#multimodal#evaluation

Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG

The article presents Grounded Delta Planning RAG (GDP-RAG), a novel framework for multi-hop question answering that enhances Retrieval-Augmented Generation (RAG) by focusing on information deltas. Key innovations include preliminary retrieval for grounding, a gap-conditioned planning prompt to target missing information, and a skeletal trajectory for subqueries, which collectively improve accuracy to 60.63% on benchmarks like HotpotQA while reducing computational costs significantly compared to existing methods. This approach is crucial for practitioners as it optimizes resource usage while enhancing the reliability of multi-step reasoning in AI applications.

arXiv cs.AI34 d agofound 15 d ago#multi-hop#question#answering

Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking

The paper presents a leakage-aware benchmarking methodology for forecasting using a 7B parameter open-source retrieval-augmented LLM. It emphasizes decision-time constraints by utilizing only observable macroeconomic variables and a critic-actor architecture to rank equity factors, achieving a median Spearman rank IC of +0.154. This approach is significant for practitioners as it highlights the importance of avoiding information leakage in LLM forecasting and demonstrates the potential of combining LLMs with macroeconomic data for improved financial decision-making.

arXiv cs.AI34 d agofound 15 d ago#llm#forecasting#retrieval

When Confidence Takes the Wrong Path: Diagnosing Retrieval-State Lock-In in RAG

The paper introduces the concept of "retrieval-state lock-in," a failure mode in retrieval-augmented generation (RAG) systems where confidence is misinterpreted due to stable errors arising from defective retrieval states. The authors provide a diagnostic framework that separates the components of confidence—answer surface, retrieved evidence, and retrieval state—highlighting that 42% of errors in their ontology-guided knowledge-graph RAG (KG-RAG) system show zero answer dispersion despite evidence checks indicating issues. This approach offers a method for practitioners to enhance decision-making in RAG systems by ensuring that all three components align before accepting an answer, achieving a 91.9% pooled precision, though at the cost of coverage, certifying only 7.7% of answers as low-risk.

arXiv cs.AI34 d agofound 15 d ago#retrieval-augmented generation#confidence#diagnosis

Cross-lingual Retrieval-Augmented Classification for Dysarthria Severity Assessment

The article introduces Cross-lingual Retrieval-Augmented Classification (CRAC) for automatic dysarthria severity assessment, utilizing a novel align-retrieve-fuse pipeline to leverage speech data from different languages. By employing supervised contrastive learning to create a severity-focused embedding space and integrating top-k references via cross-attention during classification, CRAC achieves balanced accuracies of 87.3% on a Korean dataset and 86.7% on an Italian dataset, surpassing monolingual baselines by significant margins. This approach addresses the challenge of limited labeled pathological speech data, offering a promising method for practitioners dealing with dysarthria assessment across languages.

arXiv cs.AI34 d agofound 15 d ago#dysarthria#classification#cross-lingual#llm

Look Before You Zoom: Adaptive Routing for the Resolution-Context Trade-off in Visual RAG

The article introduces ViRGo (Visual Retrieval or Global Perception), a lightweight framework designed to optimize visual retrieval in Vision-Language Models (VLMs) by addressing the resolution-context trade-off. ViRGo dynamically assesses object scale using the VLM's localization heads and selects the most effective retrieval method—global perception, patch-based, or attention-based—based on the object's size, enhancing accuracy and reducing inference time across various Visual Question Answering (VQA) benchmarks. This adaptive routing approach allows practitioners to balance detail recovery for small targets with maintaining context for larger objects, improving overall model efficiency in real-time applications.

arXiv cs.CL34 d agofound 10 d ago#visual#rag#vlg

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

DoGMaTiQ is a newly introduced pipeline designed for the automated generation of question-and-answer nuggets to evaluate long-form, citation-backed reports, particularly in cross-lingual contexts. The pipeline consists of three stages: document-grounded nugget generation, paraphrase clustering, and nugget subselection based on quality criteria, and it integrates with the AutoArgue framework for automatic report evaluation. Extensive experiments on TREC shared tasks demonstrate strong rank correlations with human evaluations, highlighting the importance of a robust LLM nugget generator in the evaluation process, with the code and artifacts made publicly available for further research.

arXiv cs.CL34 d agofound 12 d ago#evaluation#nuggets#report-evaluation

Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation

The article introduces TTARAG, a test-time adaptation method for Retrieval-Augmented Generation (RAG) systems that dynamically updates model parameters during inference to enhance performance in specialized domains. By enabling the model to predict retrieved content, TTARAG addresses distribution shifts that typically hinder generalization, achieving significant performance gains across six specialized domains compared to baseline RAG systems. This advancement is crucial for practitioners seeking to optimize LLMs in niche applications where external knowledge integration is essential.

arXiv cs.CL34 d agofound 12 d ago#retrieval-augmented generation#test-time adaptation

Graph-Enhanced Large Language Models for Spatial Search

The paper discusses the integration of graph-enhanced techniques into Large Language Models (LLMs) to improve their spatial reasoning capabilities, which are crucial for domains like urban planning and civil engineering. It highlights the limitations of current LLMs in handling spatial data and proposes the use of graph structures to enhance reasoning over such data. This advancement is significant for practitioners as it could lead to more effective search engines and applications capable of addressing complex spatial queries.

arXiv cs.AI34 d agofound 15 d ago#llm#spatial-reasoning#graph#search

Data Selection Through Iterative Self-Filtering for Vision-Language Settings

The article presents a novel iterative self-filtering method for data selection in training vision-language models, leveraging a CLIP model. This approach dynamically refines the training dataset by balancing high-probability clean samples with diverse examples, resulting in improved downstream performance without requiring additional or pre-trained data. This method is significant for practitioners as it enhances model training efficiency and effectiveness in handling noisy datasets.

arXiv cs.AI34 d agofound 15 d ago#data selection#vision-language

Dissecting Agentic RAG: A Component Ablation for Multi-Hop QA with a Local 7B Model

The paper presents an ablation study on the Agentic retrieval-augmented generation (RAG) system using a local 7 billion parameter model (Qwen2.5-7B-Instruct) for multi-hop question answering. The study evaluates a full agentic RAG pipeline against a single-pass dense-retrieval baseline, achieving significant improvements (EM=53.2%, F1=61.6%) compared to the baseline (EM=43.1%, F1=54.0%). Key findings indicate that fixed hybrid retrieval methods outperform adaptive routing, and that two retrieval iterations effectively capture most gains, suggesting that simpler, fixed strategies can be more effective than complex adaptive approaches in resource-constrained environments.

arXiv cs.CL34 d agofound 12 d ago#qa#multi-hop#ablation study

Point-in-Time Financial RAG with Frozen LLMs and Market-Feedback Adaptive Retrieval

The paper presents a novel approach to financial retrieval-augmented generation (RAG) systems that utilizes a frozen language model (LLM) and an adaptive retrieval mechanism based on Bayesian source memory. The method enhances predictive performance on a fixed dataset of 89 Nasdaq stocks, achieving a macro-F1 score improvement from 0.438 to 0.471 and a portfolio Sharpe ratio increase from 0.52 to 0.84 by incorporating market-context cards and feedback from residual-return signals. This work emphasizes the significance of optimizing retrieval strategies in financial applications, suggesting that effective evidence selection can be as critical as the reading capabilities of the model itself.

arXiv cs.CL34 d agofound 12 d ago#financial#retrieval-augmented-generation#market-feedback

Revisiting Text Ranking in Deep Research

The article presents a comprehensive evaluation of text ranking methods in deep research, focusing on the effectiveness of retrieval units, pipeline configurations, and query characteristics. Experiments conducted on the BrowseComp-Plus dataset involved two open-source agents, five retrievers, and three re-rankers, revealing that passage-level units are more efficient in constrained contexts, and that a proposed query-to-question (Q2Q) method enhances performance by mitigating query mismatches. This research is significant for practitioners as it clarifies the impact of various configurations on retrieval effectiveness, guiding the design of more efficient LLM-based search agents.

arXiv cs.AI34 d agofound 14 d ago#text ranking#deep research#llm

Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

The paper investigates the in-context retrieval capabilities of Transformers, State Space Models (SSMs), and hybrid architectures, focusing on two tasks: n-gram retrieval and position retrieval. Hybrid models demonstrate superior data efficiency and extrapolation capabilities compared to SSMs and match or exceed Transformers in specific retrieval tasks, although Transformers retain an edge in position retrieval. The study highlights the differences in how these architectures learn positional associations, with SSMs developing locality-aware embeddings, while Transformers leverage causal attention and positional encodings for improved data efficiency.

arXiv cs.AI34 d agofound 14 d ago#retrieval#transformers#hybrid

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

The article announces the release of DeepResearch-9K, a challenging benchmark dataset designed for deep-research agents, featuring 9,000 multi-step questions across three difficulty levels, high-quality search trajectories from the Tongyi-DeepResearch-30B-A3B model, and verifiable answers. It also introduces the open-source framework DeepResearch-R1, which supports multi-turn web interactions and various reinforcement learning approaches. This release is significant for practitioners as it provides a robust dataset and framework to enhance the training and evaluation of deep-research agents, addressing the current limitations in available resources.

arXiv cs.AI34 d agofound 14 d ago#dataset#deep-research#qa

Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage

The article presents a systematic study on the relationship between retrieval quality and generation effectiveness in Retrieval-Augmented Generation (RAG) systems. It analyzes 15 text and 10 multimodal retrieval stacks across various RAG pipelines using two text benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), revealing strong correlations between retrieval metrics and the information coverage of generated responses. This research underscores the importance of aligning retrieval objectives with generation goals, suggesting that retrieval metrics can serve as reliable indicators for RAG performance, which is crucial for practitioners optimizing these systems.

arXiv cs.AI34 d agofound 14 d ago#retrieval-augmented generation#information coverage#metrics

From RAG to Agentic RAG for Faithful Islamic Question Answering

The article introduces IslamicFaithQA, a new generative benchmark for Islamic question answering consisting of 3,810 bilingual items designed to measure hallucination and abstention in responses. It presents an agentic Quran-grounding framework (agentic RAG) that incorporates structured tool calls for iterative evidence seeking, demonstrating significant performance improvements over standard RAG, particularly with the Qwen3 4B model. This work is crucial for practitioners as it provides a robust evaluation framework and resources that enhance the reliability of LLMs in sensitive applications like religious question answering.

arXiv cs.AI34 d agofound 14 d ago#llm#islamic#question-answering#benchmark

Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

The article presents a retrieval-augmented approach for Text-to-CT generation that enhances anatomical guidance by integrating semantic information from related clinical cases. Utilizing a 3D vision-language encoder and a text-conditioned latent diffusion model with a ControlNet branch, the method improves image fidelity and clinical consistency on the CT-RATE dataset, allowing for explicit spatial controllability. This technique addresses the limitations of existing models by combining semantic conditioning with anatomical plausibility, offering a scalable solution for volumetric medical image synthesis.

arXiv cs.AI34 d agofound 14 d ago#text-to-CT#generative models#anatomical guidance

Document Optimization for Black-Box Retrieval via Reinforcement Learning

The paper introduces a novel approach to document optimization for black-box retrieval systems, framing document expansion as a document optimization problem fine-tuned using reinforcement learning with GRPO, leveraging retrieval rank improvements as rewards. This method, applicable to various retriever types, shows significant performance gains, with the OpenAI text-embedding-3-small model improving nDCG5 scores on code retrieval from 58.7 to 66.8 and on visual document retrieval from 53.3 to 57.6, even outperforming the larger text-embedding-3-large model. The findings suggest that learned document transformations can enhance retrieval efficiency, particularly for smaller models, and combining this approach with retriever fine-tuning yields the best results, as demonstrated by improvements in Jina-ColBERT-V2.

arXiv cs.CL34 d agofound 12 d ago#retrieval#reinforcement-learning#document-optimization#llm

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

The article introduces the $\text{DataClaw}_0$-9B model, which employs a two-stage pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) to enhance data refinement and tailoring from large unstructured multimodal streams. It also presents the $\text{DataClaw}_0$-val benchmark for evaluating data refinement capabilities, demonstrating effective performance in video generation, visual question answering (VQA), and GUI navigation. This advancement is significant for practitioners as it provides a method to improve model adaptability in scenarios with limited training data by generating high-information-density tailored datasets.

arXiv cs.AI34 d agofound 16 d ago#data_processing#multimodal

$\pi$-RAG: Oblivious Retrieval via Semantic Quantization and Transcendental Addressing for Large Language Models

The paper presents $\pi$-RAG, a new architecture designed for oblivious retrieval in Large Language Models (LLMs) that mitigates risks associated with sensitive data exposure. It employs the digits of $\pi$ to create a transcendental addressing mechanism, which serves as an immutable layer between the LLM and private data, ensuring semantic understanding while maintaining privacy. The architecture incorporates a Semantic Quantization Layer that maps user inputs to Canonical Intent Centroids, enhancing security and compliance for applications in high-stakes sectors like finance and healthcare.

arXiv cs.AI34 d agofound 16 d ago#retrieval#llm#privacy

The Token Tax of Epistemic Accuracy: Comparing RAG and Long-Context Architectures for Document-Grounded Generative AI Applications

This study compares retrieval-augmented generation (RAG) and long-context prompting architectures for document-grounded generative AI, focusing on their impact on epistemic accuracy in high-stakes applications. Long-context prompting achieved a correctness rate of 73.1% compared to RAG's 65.4%, but incurred a significant "token tax," costing 26 times more per query due to increased input token consumption. These findings highlight the trade-offs between accuracy and resource efficiency, which are critical considerations for practitioners working with large language models in knowledge-intensive domains.

arXiv cs.AI34 d agofound 16 d ago#rag#accuracy#cost

Topic-to-Timestamp Alignment by Constrained Evidence Selection

The study presents a novel approach to topic-to-timestamp alignment in meeting transcripts, focusing on constrained temporal candidate selection rather than direct timestamp generation. By employing Mistral-7B-Instruct, the method improved Recall@5 from 31.9% to 50.0% and reduced mean absolute error (MAE) from 837.0 seconds to 761.0 seconds across 420 queries. This highlights the importance of retrieval quality and output design in enhancing temporal grounding in lengthy transcripts, offering valuable insights for practitioners developing LLMs for similar applications.

arXiv cs.CL34 d agofound 13 d ago#timestamp alignment#meeting transcripts#RAG

Fixed RAG Compression Collapses Measured Reader Scaling

The paper introduces ragscale, a toolkit for auditing reader scaling in Retrieval-Augmented Generation (RAG) compression, highlighting that fixed compression can distort accuracy measurements and model rankings across various readers and benchmarks. It demonstrates that compression can improve weak readers while detrimental to strong ones, with significant results across 20 readers and multiple QA and summarization benchmarks. This toolkit enables practitioners to evaluate the impact of compression on model performance more effectively, addressing a critical gap in current RAG evaluation methods.

arXiv cs.CL34 d agofound 13 d ago#rag#compression#evaluation

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

HAKARI-Bench is a newly released lightweight benchmark designed for evaluating retrieval architectures and efficiency settings under unified conditions, featuring 35 benchmarks and 551 tasks across 43 languages. It allows for model-agnostic comparisons of five retrieval families (BM25, dense, sparse, late interaction, rerankers) and their efficiency variants, achieving a high correlation with established benchmarks (Spearman >0.97) across 55 models. This tool is significant for practitioners as it facilitates rapid model selection and regression detection without the overhead of comprehensive evaluations, thus aiding in the optimization of retrieval-augmented generation and semantic search systems.

arXiv cs.CL34 d agofound 13 d ago#retrieval#benchmark#llm

Ghost Vectors: Soft-Deleted Embeddings Remain Reconstructible in HNSW Vector Databases

The paper discusses vulnerabilities in HNSW vector databases regarding soft-deleted embeddings, which remain recoverable despite deletion requests, raising compliance issues with data protection regulations like GDPR and HIPAA. The authors demonstrate that using the Vec2Text inversion model, they can recover significant amounts of sensitive data from real-world datasets, achieving 100% recovery for structured data like patient demographics. To address this, they propose Epoch Key Rotation, an encryption method that ensures deleted vectors are unrecoverable, completing the process in 2.5 ms for 500 records while providing cryptographic proof of deletion, which is critical for practitioners concerned with data privacy and compliance in AI applications.

arXiv cs.AI34 d agofound 20 d ago#rag#data-erasure#vector-databases

Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export

The article introduces a comprehensive tutorial on utilizing Crawlee for Python to create a web crawling pipeline that includes robots handling, link graph construction, and RAG chunk export. It details the use of various crawlers such as BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler to extract diverse data types from a demo website, including titles, metadata, and JavaScript-rendered content. This resource is significant for practitioners as it provides a practical framework for building efficient web scraping workflows that can be integrated with machine learning models for data processing and analysis.

MarkTechPost36 d agofound 21 d ago#web crawling#rag#data extraction

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages

Liquid AI has released two models: LFM2.5-Embedding-350M, a dense bi-encoder, and LFM2.5-ColBERT-350M, a late-interaction model, designed for fast multilingual search across 11 languages. These models are optimized for deployment on edge devices, enhancing retrieval efficiency in multilingual contexts. This development is significant for practitioners as it enables improved search capabilities in resource-constrained environments, leveraging advanced architectures for better performance.

MarkTechPost38 d agofound 24 d ago#multilingual#search#dense-bi-encoder

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

The article introduces a framework for pre-training the Tsetlin Machine (TM) using semantic clusters derived from pre-trained language models like BERT, aiming to enhance interpretability in text classification. By employing K-means or Top2Vec for clustering, the TM learns interpretable semantic keywords without relying on static embeddings, achieving competitive performance with BERT across five datasets while maintaining transparency. This approach is significant for practitioners as it combines the interpretability of TM with the contextual understanding of language models, making it suitable for high-stakes applications.

arXiv cs.CL38 d agofound 22 d ago#RAG#evidence ordering#inference

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

The paper introduces the Video Candidate Generation (VCG) system, a multimodal retrieval framework designed for e-commerce video feeds facing extreme cold-start conditions. VCG utilizes a domain-adapted vision-language model based on CLIP to perform zero-shot retrieval by mapping users and videos into a shared semantic space, addressing challenges posed by lack of interaction history and engagement biases. Evaluation results indicate that VCG significantly improves video completion rates by 50% compared to traditional methods, demonstrating its effectiveness in real-world applications.

arXiv cs.AI38 d agofound 23 d ago#multimodal retrieval#e-commerce#video feeds

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench is a newly introduced benchmark designed to evaluate strategic reasoning in Vision-Language Models (VLMs) using the Beyond All Reason RTS game. It features a comprehensive evaluation framework that includes diverse gameplay scenarios, mini-games for targeted competency assessment, and a self-evolving generation system for creating new challenges. The benchmark reveals that current state-of-the-art VLMs struggle with coordination and planning in complex, multi-agent environments, highlighting critical areas for improvement in AI models applied to strategic tasks.

arXiv cs.AI38 d agofound 23 d ago#strategic reasoning#vlm#benchmark

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

The article presents a vision for Policy-aware Vector Search, addressing the limitations of fine-grained access control (FGAC) in vector databases, particularly in security-sensitive applications. It formalizes the FGAC policy model for vector databases and discusses the challenges of balancing policy enforcement with approximate nearest neighbor (ANN) search recall and query latency. This work is significant for practitioners as it outlines potential strategies for implementing FGAC in vector databases, which is crucial for secure AI applications.

arXiv cs.AI38 d agofound 23 d ago#vector databases#access control#policy#retrieval-augmented generation

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

The article presents an AI economist agent framework that integrates Retrieval-Augmented Generation (RAG) with knowledge graphs and large language models (LLMs) for economic scenario analysis. This model leverages economic data and theory to enhance the coherence and traceability of generated narratives, as demonstrated in applications related to U.S. inflation persistence and bank stress-test narratives. This framework is significant for practitioners as it allows for more rigorous and data-grounded economic analysis, improving the reliability of insights derived from LLMs in economic contexts.

arXiv cs.AI38 d agofound 23 d ago#rag#knowledge graphs#llm

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

The paper introduces ELVA, a novel rule-based reinforcement learning framework designed to address grain blindness in Universal Multimodal Retrieval (UMR) by treating negative samples based on their similarity to positive samples. It extends Reinforcement Learning with Verifiable Rewards (RLVR) for retrieval tasks and introduces MRBench, a benchmark for evaluating multi-grain queries. ELVA achieves state-of-the-art performance on standard retrieval benchmarks, with a 13.1% improvement on MRBench, highlighting its significance for practitioners focused on enhancing retrieval models with nuanced query handling.

arXiv cs.AI38 d agofound 23 d ago#retrieval#contrastive-learning#multimodal

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

QueryGaussian is a novel, training-free framework designed for scalable open-vocabulary 3D instance retrieval, addressing the limitations of existing scene-level embedding methods that struggle with memory and computational costs in complex environments. It utilizes a unique instance-level query mechanism, leveraging pre-trained 2D vision models and a maximum-weight association strategy to enhance semantic-visual consistency while incorporating a temporal fusion module for improved projection accuracy. Experimental results indicate that QueryGaussian achieves comparable accuracy to state-of-the-art approaches while reducing GPU memory usage by over 70% and accelerating inference by 180x, making it feasible for efficient instance retrieval in large-scale urban scenes on consumer-grade hardware.

arXiv cs.AI38 d agofound 23 d ago#3D retrieval#instance retrieval#open-vocabulary

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

OmniSONAR is a new family of omnilingual sentence embedding models capable of integrating text, speech, code, and mathematical expressions into a unified semantic space, achieving state-of-the-art performance across thousands of languages. Utilizing progressive training and a two-stage teacher-student distillation framework, it halves cross-lingual similarity search error on the 200-language FLORES dataset and significantly outperforms NLLB-3B in translation tasks. This model is particularly relevant for practitioners as it facilitates high-quality cross-lingual and cross-modal applications, enabling effective multilingual processing and reducing the need for extensive language-specific resources.

arXiv cs.CL38 d agofound 22 d ago#sentence-embeddings#cross-lingual#omnilingual

Multi-Agent Transactive Memory

The article introduces Multi-Agent Transactive Memory (MATM), a framework designed for the population-level storage and retrieval of agent-generated trajectories to enhance knowledge sharing among diverse LLM agents. By allowing producer agents to contribute their trajectories to a shared repository, MATM enables consumer agents to retrieve these artifacts, improving task execution in interactive environments like ALFWorld and WebArena. The experimental results indicate that using MATM significantly enhances downstream task performance and reduces interaction steps, highlighting its potential as a design pattern for experience sharing in decentralized agent ecosystems.

arXiv cs.AI38 d agofound 24 d ago#agents#memory#knowledge

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

The article presents the Agentic Clinical Information Extraction (ACIE) system, an on-premise retrieval-augmented generation (RAG) pipeline designed to handle complex patient contexts by integrating complete document-level metadata for enhanced extraction accuracy. It addresses shortcomings in standard RAG approaches, particularly in temporal reasoning and cross-document dependencies. In a study involving 7,326 clinician evaluations, ACIE achieved a 96.5% acceptance rate for extracted information, demonstrating its potential for reliable clinical decision support and verification in medical contexts.

arXiv cs.AI38 d agofound 24 d ago#rag#clinical#agents

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

CacheWeaver is a new method designed to enhance Retrieval-Augmented Generation (RAG) by implementing cache-aware evidence ordering to optimize prompt efficiency. It utilizes a prefix tree to prioritize the most reusable prefixes during evidence retrieval, achieving a 20-33% reduction in median time-to-first-token (TTFT) across various vLLM configurations while maintaining answer quality. This approach is significant for practitioners as it provides a lightweight solution to improve inference speed without altering the underlying serving engine or evidence sets.

arXiv cs.CL38 d agofound 22 d ago#RAG#cache#inference

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

The article presents a novel Object-Centric OOD detection framework that leverages Object CO-occurrence (OCO) patterns to improve the detection of out-of-distribution (OOD) samples, particularly near-OOD instances. The method predicts disentangled representations for test samples and categorizes them based on observed co-occurrence patterns in the training data, allowing for a more nuanced detection process that incorporates semantic relationships. Experimental evaluations show that OCO achieves competitive results in various OOD scenarios, addressing both semantic and covariate shifts, with the code available at https://github.com/Michael-McQueen/OCO, making it a valuable tool for practitioners in enhancing model reliability.

arXiv cs.AI38 d agofound 22 d ago#ood-detection#object-co-occurrence

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

The paper introduces Streaming Retrieval-Augmented Generation (Streaming RAG), which enhances user experience by issuing tool queries in parallel with user input to reduce perceived latency. It characterizes a concept called tool-intent stabilization, measuring when speculative queries converge on relevant results, and establishes a model-agnostic bound on tool latency savings based on user input rates. The findings indicate that at optimal conditions (600ms latency, 3 words/sec input), 73.9% of queries can significantly hide latency, providing insights for AI practitioners on optimizing query timing and tool integration in real-time applications.

arXiv cs.CL38 d agofound 22 d ago#streaming#tool use#llm

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

The paper introduces DICE (Document Inference via Chunk Evidence), a novel approach for long-document retrieval that addresses failures in dense retrieval caused by document-side early compression. By independently encoding document chunks and aggregating them into a single vector, DICE significantly improves retrieval performance on long documents, achieving notable gains on benchmarks like Dream and Needle for slices exceeding 4k tokens. This method demonstrates that enhancing document-level encoding can effectively mitigate evidence dilution, making it a valuable strategy for practitioners working with long-form content in retrieval systems.

arXiv cs.CL39 d agofound 25 d ago#retrieval#long-documents#evidence

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

SproutRAG is a novel attention-guided hierarchical retrieval-augmented generation (RAG) framework designed to enhance the retrieval granularity and contextual coherence of long documents. It organizes sentence-level chunks into progressively larger semantic units using learned inter-sentence attention, allowing for multi-granularity retrieval without incurring additional LLM calls or lossy summarization. Experimental results show a 6.1% improvement in information efficiency across various benchmarks, making it a significant advancement for practitioners seeking to optimize retrieval processes in complex document scenarios.

arXiv cs.CL39 d agofound 25 d ago#rag#long-document#attention

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG is a new retrieval framework designed to enhance retrieval-augmented generation (RAG) systems by integrating topic metadata into the retrieval process. It employs a lightweight retriever trained via LLM-teacher distillation, allowing for topic-aware retrieval without additional LLM calls, resulting in an average information efficiency improvement of 8.24% and over five times lower latency compared to existing RAG baselines across six complex benchmarks. This approach is significant for practitioners as it balances precision and efficiency in document retrieval for deep research tasks, potentially reducing operational costs and enhancing the quality of retrieved evidence.

arXiv cs.CL39 d agofound 25 d ago#rag#retrieval#metadata

AI search grounded in Facebook posts? What could go wrong?

Meta has introduced an AI Mode for search that aims to improve user experience by providing recommendations for activities based on Facebook posts. However, the effectiveness of this model is questioned due to its current inaccuracies in understanding context and relevance. For practitioners, this highlights the challenges in developing AI systems that rely on social media data for context-aware recommendations, emphasizing the need for robust training methodologies and data curation.

The Verge — AI39 d agofound 29 d ago#ai#search#meta

Temporal Preference Optimization for Unsupervised Retrieval

The article introduces TPOUR (Temporal Preference Optimization for Unsupervised Retriever), a novel approach for enhancing unsupervised dense retrieval systems by addressing temporal relevance through a method called Temporal Retrieval Preference Optimization (TRPO). TPOUR demonstrates superior performance over both unsupervised and supervised baselines in temporal information retrieval tasks, achieving a 12.15% improvement in nDCG@5 compared to Qwen-Embedding-8B, despite being 72.7x smaller. This development is significant for practitioners as it enables more accurate retrieval of temporally aligned documents without the need for supervised training with explicit timestamps, thus enhancing the capability of AI systems in handling time-sensitive queries.

arXiv cs.AI40 d agofound 28 d ago#retrieval#unsupervised#temporal

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

The paper introduces HyGRAG, a hierarchical graph framework for Retrieval-Augmented Generation (RAG) that integrates contextual and relational information from external knowledge sources. It employs hybrid graphs with chunk and entity nodes, allowing for dynamic updates and efficient retrieval across multiple abstraction levels. Experimental results indicate a 9.7% improvement in multi-hop reasoning accuracy, highlighting its potential for enhancing the performance of large language models in knowledge-intensive applications.

arXiv cs.AI40 d agofound 29 d ago#context-aware#relation-aware#retrieval-augmented

IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

IUU+DB is a large language model-driven system designed to create a global incident database for tracking illegal, unreported, and unregulated (IUU+) fishing activities and related crimes. It utilizes information extraction techniques to classify documents, extract critical data elements such as actors and locations, and enables deduplication and trend analysis. This tool is significant for practitioners as it enhances the ability to analyze fragmented data, identify hotspots, and support risk assessments and policy enforcement in the fisheries sector.

arXiv cs.AI40 d agofound 28 d ago#information-extraction#fishing#llm

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

The article introduces Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework designed for domain adaptive retrieval that addresses limitations in existing methods. PSCA enhances class-level semantic alignment by utilizing orthogonal prototypes to maximize inter-class separability and improve the reliability of pseudo-labels through geometric proximity. This approach allows for better feature reconstruction and quantization, resulting in improved hash coding quality and superior performance across multiple datasets, which is crucial for practitioners aiming to enhance retrieval systems in domain-shift scenarios.

arXiv cs.AI40 d agofound 28 d ago#retrieval#domain adaptation

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

The article introduces MODE-RAG, a Multi-Agent system designed to address hallucinations and logical fabrications in Multimodal Retrieval-Augmented Generation (M-RAG) models. It leverages Variational Free Energy (VFE) and internal attention states to manage interventions dynamically, utilizing Monte Carlo Tree Search (MCTS) for causal reasoning and logit perturbations to mitigate sycophancy. The proposed system, evaluated against the ModeVent benchmark, demonstrates a significant reduction in hallucination rates, enhancing the reliability of M-RAG implementations for practitioners.

arXiv cs.AI40 d agofound 28 d ago#retrieval-augmented generation#hallucinations#multi-agent

Non-negative Elastic Net Decoding for Information Retrieval

The paper presents Non-Negative Elastic Net (NNN) decoding as an innovative approach to information retrieval, addressing the limitations of traditional dense retrieval methods that often yield redundant results. NNN decoding treats document selection as a joint decoding problem, allowing for a more diverse set of retrieved documents by using a non-negative linear combination of embeddings. Experimental results demonstrate that NNN decoding significantly outperforms conventional dense retrieval across various benchmarks, highlighting its potential for enhancing retrieval systems by optimizing embeddings beyond mere inner-product scoring.

arXiv cs.AI40 d agofound 28 d ago#information-retrieval#decoding

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

FlowRAG is a newly proposed framework that enhances graph-based retrieval-augmented generation (GraphRAG) by integrating a quad-level heterogeneous graph structure, which includes nodes for passages, summaries, sentences, and entities. It features a dual-granularity activation module for improved semantic recall and a frequency-aware weighted flow module that optimizes relevance routing through entity-passage links, yielding robust multi-hop reasoning capabilities. This approach demonstrates state-of-the-art performance on complex reasoning benchmarks, making it a significant advancement for practitioners focusing on knowledge-intensive AI tasks.

arXiv cs.AI40 d agofound 29 d ago#graph#retrieval-augmented#reasoning

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

The article introduces R1-SyntheticVL, a multimodal large language model trained on MMSynthetic-20K, utilizing a novel data synthesis technique called Collective Adversarial Data Synthesis (CADS). CADS comprises two phases: CAD-Generate for generating diverse multimodal data and CAD-Judge for assessing its quality, incorporating an Adversarial Context Optimization mechanism to enhance data challenge and value. This approach is significant for practitioners as it provides a framework for generating high-quality training data, potentially improving the performance of MLLMs on complex tasks.

arXiv cs.AI40 d agofound 28 d ago#data synthesis#multimodal#training

Google Cloud Introduces Open Knowledge Format (OKF): A Vendor-Neutral Markdown Spec for Giving AI Agents Curated Context

Google Cloud has introduced the Open Knowledge Format (OKF), a vendor-neutral specification designed for structuring AI agent context using markdown files with YAML frontmatter. This format allows for the efficient organization of concepts with minimal required metadata, adhering to three key design principles. The release includes reference tools and a Python consumer, which enable practitioners to implement OKF for enhanced context management in AI applications, differentiating it from traditional Retrieval-Augmented Generation (RAG) methods.

MarkTechPost41 d agofound 33 d ago#open-knowledge-format#ai#context

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

The paper introduces RL-Index, a reinforcement learning framework designed to enhance retrieval index reasoning by shifting the reasoning process from query time to the indexing stage. It utilizes Group Relative Policy Optimization (GRPO) to optimize LLM-generated rationales that improve the relationship between queries and knowledge, leading to better retrieval effectiveness and reduced latency, as demonstrated by experiments on the BRIGHT benchmark. This approach offers a robust, generalizable indexing strategy that can be integrated into various retrieval systems, making it valuable for practitioners looking to enhance retrieval performance in AI applications.

arXiv cs.AI41 d agofound 32 d ago#retrieval#reinforcement learning#index reasoning

MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

MMLongEmbed is introduced as the first comprehensive benchmark for evaluating Multimodal Embedding Models (MEMs) in long-context scenarios, comprising four retrieval tasks across text, document, and video modalities. The evaluation reveals that state-of-the-art models often rely on superficial feature matching, failing to effectively capture deep semantic and structural dependencies, with performance degradation linked to context length and information placement. This benchmark provides valuable insights for practitioners, highlighting the limitations of current architectures and the need for improved strategies in handling long-context multimodal inputs.

arXiv cs.AI41 d agofound 33 d ago#benchmark#multimodal#embedding

RASST: Retrieval-Augmented Simultaneous Speech Translation

The article introduces Retrieval-Augmented Simultaneous Speech Translation (RASST), a model designed to enhance simultaneous speech translation by integrating a lightweight speech-text retriever for accurate cross-modal retrieval of terminology under partial speech input. RASST improves terminology accuracy by nearly 40% and overall translation quality by up to 3 BLEU points while maintaining low computational overhead. This advancement is significant for practitioners as it addresses the challenges of rare and domain-specific terminology in real-time translation scenarios.

arXiv cs.CL41 d agofound 29 d ago#speech-translation#retrieval-augmented#llm

Semantics-Enhanced Retrieval-Augmented Time Series Forecasting

The article introduces SERAF (Semantics-Enhanced Retrieval-Augmented Time Series Forecasting), a novel framework that enhances time series forecasting by incorporating dual retrieval of historical time series segments and their self-generated textual descriptions. This multimodal approach addresses limitations of traditional similarity-based retrieval methods, particularly in non-stationary contexts, and has been validated through experiments on seven real-world datasets, showing improved performance over existing state-of-the-art models. SERAF's integration of numerical and semantic data could significantly benefit practitioners by providing more robust forecasting capabilities in dynamic environments.

arXiv cs.AI41 d agofound 33 d ago#time series#forecasting#rag

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

The article presents a study on the "Lost at the End" phenomenon in multimodal knowledge-based visual question answering (KB-VQA), highlighting a shift from a U-shaped information retrieval pattern to a primacy bias where earlier retrieved passages significantly outperform later ones. The research involved three open-source vision-language models (VLMs) with 7B/8B parameters and two KB-VQA benchmarks, demonstrating a 16 to 26 point advantage for gold passages presented first. The authors propose that recall@k is inadequate for measuring performance in deployed KB-VQA systems and introduce a gold-position protocol for evaluating reader-side interventions to mitigate this bias, which could inform better model design and retrieval strategies for practitioners.

arXiv cs.AI41 d agofound 31 d ago#retrieval#question#answering

SPI: Query-Depth-Adaptive Indexing for Streaming RAG in Vector Databases

The article introduces Semantic Pyramid Indexing (SPI), a novel indexing framework for vector databases designed to enhance retrieval-augmented generation (RAG) by allowing for incremental updates and adaptive retrieval depth based on query complexity. SPI organizes embeddings into multiple semantic resolution levels and utilizes an uncertainty-aware controller for efficient query processing, achieving a 1.4–2.3× reduction in average retrieval latency with competitive Recall@10 on datasets like MS MARCO and Natural Questions. This framework supports progressive coarse-to-fine approximate nearest neighbor search and integrates seamlessly with existing backends such as FAISS and Qdrant, making it a valuable tool for practitioners aiming to optimize query performance in dynamic environments.

arXiv cs.AI41 d agofound 31 d ago#rag#vector-database#query-processing

When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

The paper introduces MAD-RAG, a novel intervention designed to address Attention Distraction (AD) in Retrieval-Augmented Generation (RAG) for Large Vision-Language Models (LVLMs). MAD-RAG decouples visual grounding from context integration using a dual-question formulation and attention mixing, leading to significant performance improvements on knowledge-based visual question answering tasks, with gains of up to 9.20% over existing baselines on datasets like OK-VQA and E-VQA. This approach is crucial for practitioners as it enhances model reliability by mitigating attention-related failures without incurring substantial computational costs.

arXiv cs.AI41 d agofound 31 d ago#attention#retrieval-augmented#lvms

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

The article presents LM-SPT, a novel speech tokenization method that utilizes semantic speech-resynthesis distillation to enhance alignment with language models (LMs). Unlike traditional approaches that rely on self-supervised learning teachers and pooling, LM-SPT resynthesizes speech from semantic tokens, leading to lower frame rates and improved semantic alignment without sacrificing speech reconstruction fidelity. Experimental results demonstrate that LM-SPT outperforms existing semantic-enhanced tokenizers in tasks such as automatic speech recognition and text-to-speech, which is significant for practitioners seeking to integrate SLMs more effectively with LMs.

arXiv cs.AI41 d agofound 31 d ago#speech-tokenization#semantic-distillation#language-models

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

The article introduces SAGA, a framework that enhances vision encoders for retrieval by leveraging frozen multimodal large language models (MLLMs) to provide attribute-aware training signals. By utilizing Group Relative Policy Optimization (GRPO), SAGA replaces traditional scalar supervision with gradients that focus on specific visual attributes, resulting in improved embedding performance. The framework demonstrates a 3 to 6 point increase in Recall@1 across several benchmark datasets, making it a significant advancement for practitioners in zero-shot image retrieval tasks.

arXiv cs.AI41 d agofound 32 d ago#visual embeddings#attribute gradients#mllm

MAGE-RAG: Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA

MAGE-RAG introduces a multigranular adaptive graph evidence framework designed for long-document multimodal question answering, addressing the limitations of existing retrieval-augmented generation (RAG) methods. The framework constructs an evidence graph that captures various relationships among text, images, and layout elements, allowing for dynamic evidence selection during query processing. Achieving 52.75% accuracy on LongDocURL and 53.26% on MMLongBench-Doc, MAGE-RAG's approach enhances evidence relevance while managing context noise, which is critical for practitioners developing efficient long-document QA systems.

arXiv cs.AI41 d agofound 32 d ago#rag#qa#multimodal#evidence

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

The paper introduces a retrieval-augmented vision-language-action (VLA) policy that allows for the adaptation to new tasks at test time without the need for per-task fine-tuning. By training a policy on paired demonstrations and utilizing a retrieval mechanism to access additional task-specific demonstrations, the method enables efficient cross-embodiment generalization, particularly enhancing performance in the Cosmos Policy framework. This approach reduces the computational burden associated with adapting models to new tasks, making it significant for practitioners looking to implement scalable and flexible AI systems in robotics and related fields.

arXiv cs.AI41 d agofound 32 d ago#retrieval#vision-language#policy

Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes, Visualization, and Hybrid Prompts

The study introduces a framework for zero-shot adaptation of the Segment Anything Model (SAM) for seismic interpretation, which enhances segmentation accuracy without the need for extensive fine-tuning. Key components include aligning seismic attributes and visualization techniques with geological targets, and a hybrid prompting strategy that combines user-defined point prompts with dense mask prompts from SAM's internal feature activations. This approach allows practitioners to utilize SAM effectively across various geological contexts while minimizing the dependency on labeled data, thus streamlining the application of large pretrained models in seismic analysis.

arXiv cs.AI41 d agofound 32 d ago#seismic-interpretation#prompting#foundation-models

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

The article presents a system that integrates Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs) for personalized reading content recommendations. The architecture features four modules and utilizes three LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B, with prompting strategies including Chain-of-Thought, zero-shot, and few-shot. Experimental results indicate that RAG enhances performance, improving relevance and groundedness by 26-35 percentage points, which is significant for practitioners seeking to develop more accurate and contextually relevant LLM applications.

arXiv cs.AI41 d agofound 32 d ago#RAG#LLM#content-recommendation

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

FusionRS is introduced as the first large-scale RGB-infrared-text dataset aimed at enhancing dual-modal vision-language learning in remote sensing. It includes aligned RGB and infrared image pairs, each paired with conventional and infrared-specific captions, allowing for improved RGB-IR alignment and performance in tasks such as infrared-to-text retrieval and dual-modal captioning. This dataset facilitates the training of CLIP-style models and generative vision-language models, emphasizing the necessity of modality-specific textual supervision for effective RGB-infrared representation learning, which is crucial for practitioners developing advanced remote sensing applications.

arXiv cs.AI41 d agofound 31 d ago#remote sensing#vision-language#dataset

Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era

The paper presents a unified framework for approximate nearest-neighbor search, focusing on the interplay of projection, quantization, and organization in retrieval methods. It introduces the BitBudget benchmark for reproducible measurement, highlighting that a one-bit code with full-precision re-ranking can match the quality of uncompressed methods while significantly reducing memory usage. The findings emphasize the effectiveness of learned embeddings and supervised codes in improving retrieval quality, which is crucial for practitioners optimizing performance in large-scale retrieval and retrieval-augmented generation systems.

arXiv cs.AI41 d agofound 31 d ago#retrieval-augmented-generation#hashing#approximate-nearest-neighbour

Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

The article introduces SearchFireSafety, a new benchmark designed for evaluating structure-aware retrieval and safety in statute-centric legal question answering (QA). It focuses on fire-safety regulations and assesses models' abilities to retrieve hierarchically linked evidence while managing hallucination risks when context is incomplete. Experiments demonstrate that while graph-guided retrieval enhances performance, it also increases hallucination rates in domain-adapted models lacking essential statutory context, underscoring the importance of benchmarks that address both retrieval effectiveness and model safety in legal AI applications.

arXiv cs.AI41 d agofound 30 d ago#legal_qa#statute#retrieval

SAG: SQL-Retrieval Augmented Generation with Query-Time Dynamic Hyperedges

The paper introduces SAG (SQL-Retrieval Augmented Generation), a novel architecture designed to enhance retrieval-augmented generation by using SQL join queries to dynamically link events into local hyperedges at query time, thus eliminating the need for a global static graph. SAG demonstrates superior performance on multi-hop reasoning benchmarks, achieving 80.0% Recall@5 on MuSiQue and outperforms existing methods on 8 out of 9 Recall@K metrics across HotpotQA and 2WikiMultiHop. This approach allows for incremental updates and efficient scaling, making it a viable solution for practitioners needing robust retrieval mechanisms in large-scale AI applications.

arXiv cs.CL41 d agofound 30 d ago#retrieval-augmented#SQL#dynamic

RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

RoTRAG is a novel retrieval-augmented framework designed for detecting harmful content in multi-turn dialogues by integrating human-written moral norms, termed Rules of Thumb (RoTs). It retrieves relevant RoTs for each conversational turn, enhancing reasoning and severity classification, and introduces a binary routing classifier to optimize the need for retrieval, achieving an average 40% improvement in F1 scores and an 8.4% reduction in distributional error across benchmark datasets such as ProsocialDialog and Safety Reasoning Multi Turn Dialogue. This approach enhances interpretability and efficiency in harm assessment, making it valuable for practitioners focused on developing robust conversational AI systems.

arXiv cs.AI41 d agofound 30 d ago#harm detection#conversation#retrieval-augmented generation

Understanding the Behaviors of Environment-aware Information Retrieval

This study presents a systematic analysis of how large language models (LLMs) can adapt their query formulation strategies for different retrievers using reinforcement learning (RL). The findings indicate that retrievers have distinct optimal query styles, and the performance of retrieval-augmented generation (RAG) systems can be improved by incorporating retriever-specific human guidance and scaling model size. A novel branching-based rollout technique is introduced to enhance training stability over multi-retrieval-step trajectories, providing valuable insights for developing more effective retriever-aware RAG systems.

arXiv cs.CL41 d agofound 30 d ago#llm#retrieval#query-formulation

PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation

PathRouter is a new training framework for agentic Graph Retrieval-Augmented Generation (GraphRAG) that addresses issues of reward aliasing and search-update ambiguity in reinforcement learning. By evaluating trajectories based on both answer correctness and evidence-path overlap, PathRouter reduces reliance on shortcuts while enhancing evidence-seeking behavior. Experimental results demonstrate that PathRouter improves answer F1 scores by an average of 3.1 for 3B models and 4.9 for 7B models across six QA benchmarks, making it a significant advancement for practitioners focused on optimizing retrieval quality in LLM applications.

arXiv cs.CL41 d agofound 30 d ago#retrieval#graph#llm

TechRAG: Evidence-Gated Multimodal Agentic RAG for Technical Literature Reasoning

The paper introduces TechRAG, an evidence-gated multimodal retrieval-augmented generation (RAG) framework designed for reasoning over technical literature in fields like intelligent tires and vehicle dynamics. It features a sophisticated architecture incorporating hybrid text retrieval methods (FAISS and BM25), a Neo4j knowledge graph for evidence expansion, and a multi-agent system that includes self-correcting revision capabilities. This framework enhances the retrieval process by integrating text, visual data, and graph evidence, making it a significant advancement for practitioners aiming to improve the accuracy and depth of literature reasoning in specialized domains.

arXiv cs.AI41 d agofound 30 d ago#rag#multimodal#literature

ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

ScoreGate is a novel mechanism for adaptive retrieval-augmented generation that optimizes chunk selection during inference by leveraging bi-encoder similarity and cross-encoder reranker scores without additional model inference. It demonstrates improved retrieval efficiency, achieving a mean reciprocal rank (MRR@10) of 0.401 on the MS MARCO dataset while reducing the number of retained chunks by 35%, and maintaining high recall rates with zero false positives. This approach allows practitioners to refine retrieval processes, enhancing performance in scenarios with variable query complexity while minimizing latency and token usage.

arXiv cs.CL42 d agofound 34 d ago#retrieval-augmented generation#chunk selection#llm

Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA

The article presents Rabtriever, a rationale-based retrieval model that utilizes an on-policy distillation framework to reduce the computational costs associated with traditional cross-encoding methods. By employing the Joint-Embedding Predictive Architecture (JEPA), Rabtriever independently encodes queries and documents while achieving comparable performance to LLM-based generative rerankers, optimizing retrieval efficiency from quadratic to linear complexity. This advancement is significant for practitioners as it enhances retrieval capabilities in applications like empathetic conversations and robotic manipulations, while maintaining strong performance on traditional benchmarks such as MS MARCO and BEIR.

arXiv cs.CL42 d agofound 34 d ago#retrieval#llm#distillation#generative

Hyperdimensional computing for structured querying on tabular data embeddings

The paper presents a novel application of HyperDimensional Computing (HDC) using Holographic Reduced Representations (HRR) for structured querying on tabular data embeddings. It addresses the limitations of existing embedding methods by providing interpretable similarity scores and principled thresholds for retrieval, which are crucial for tasks like zero-match detection. The evaluation shows that HDC outperforms a graph-based baseline (EmbDI) in row retrieval across various configurations, particularly excelling in handling non-equality predicates and achieving high accuracy in attribute projection.

arXiv cs.AI42 d agofound 35 d ago#tabular data#embeddings#querying#hyperdimensional computing

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG is a newly introduced Retrieval-Augmented Generation framework that enables retrieval and integration of knowledge from diverse modalities and granularities, overcoming the limitations of existing methods that typically focus on single modalities. It employs a modality-aware routing mechanism to dynamically select the most relevant corpus for retrieval, addressing modality gaps and enhancing response accuracy. Validation across 10 benchmarks demonstrates its superior performance compared to both modality-specific and unified approaches, making it a significant advancement for practitioners looking to build more robust and versatile AI systems.

arXiv cs.AI42 d agofound 34 d ago#rag#retrieval#multimodal

Verbatim Chunks Beat Extracted Artifacts: A Controlled Ablation of Memory Representations for Long LLM Conversations

The study presents a controlled ablation comparing the effectiveness of verbatim conversation chunks versus LLM-extracted structured artifacts in conversational-memory systems. Results show that verbatim chunks significantly outperform extracted artifacts, achieving 43.9% vs. 28.0% on LoCoMo and 67.4% vs. 45.4% on LongMemEval-S. This indicates that retaining verbatim details is crucial for retrieval accuracy, suggesting that structured memory should complement rather than replace raw text in LLM applications.

arXiv cs.AI42 d agofound 35 d ago#memory-systems#conversational-ai#llm

ADORE: Iterative Query Expansion with Retrieval-Grounded Relevance Feedback

ADORE (ADapt, Observe, Relevance Evaluate) is an iterative framework for query expansion that utilizes retrieval-grounded feedback to enhance the relevance of generated queries. By combining LLM-generated pseudo-passages, corpus responses, and relevance assessments, ADORE mitigates issues like retrieval drift and misleading vocabulary. It demonstrates significant performance improvements, achieving a 24.5% increase in average nDCG@10 over BM25 on the TREC Deep Learning benchmark and a 122.9% improvement on the BRIGHT dataset, indicating its effectiveness for practitioners focused on optimizing retrieval systems.

arXiv cs.CL42 d agofound 34 d ago#query expansion#retrieval#llm#feedback

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

The paper introduces TA-RAG, a prompt-based tone-aware retrieval-augmented generation framework designed to enhance peer-support health communication, particularly in sensitive contexts like HIV support. TA-RAG incorporates explicit tone control features—stigma-free rewriting, readability adjustment, recipient adaptation, and empathy rephrasing—without requiring model fine-tuning. Evaluation results demonstrate that these components significantly improve communication quality while maintaining content integrity, highlighting the potential for prompt-based tone management in RAG systems for health-related applications.

arXiv cs.CL42 d agofound 34 d ago#llm#retrieval-augmented-generation#health-communication

Succeeding at Scale: Enterprise Retrieval Benchmark Construction and Index-Preserving Query Adaptation for Multi-Tenant Search

The article introduces DevRev-Search, a passage retrieval benchmark designed for technical customer support, addressing the challenges of domain adaptation in large-scale multi-tenant retrieval systems. It employs a fully automated pipeline for candidate generation using a combination of sparse and dense retrievers, enhanced by an LLM-as-a-Judge for relevance labeling. The study also presents index-preserving query-only adaptation strategies, demonstrating that fine-tuning only the query encoder significantly improves quality while maintaining efficiency, which is crucial for scalable enterprise retrieval solutions.

arXiv cs.AI42 d agofound 34 d ago#retrieval#benchmark#query

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

AudioDER, a new reasoning-oriented dataset for post-training Large Audio-Language Models (LALMs), has been introduced to enhance audio reasoning capabilities by addressing redundancy in existing datasets. It features approximately 191,000 samples comprising audio clips, multiple-choice questions, answer candidates, captions, and chain-of-thought rationales, leveraging a deduplication pipeline for improved corpus diversity. Experimental results demonstrate that post-training on AudioDER significantly enhances the performance of the Qwen2-Audio-7B-Instruct model across various audio reasoning benchmarks, indicating its potential as a critical resource for advancing audio reasoning in LALMs.

arXiv cs.AI42 d agofound 35 d ago#audio-language models#dataset#reasoning

Fable 5 data, including CoT

The Fable 5 dataset has been released, including 953 traces and Chain of Thought (CoT) data, now available on Hugging Face. This dataset is expected to support the development of fine-tuned models, providing practitioners with valuable resources for enhancing performance in complex reasoning tasks. The availability of these traces may facilitate advancements in training methodologies for LLMs, particularly in the context of improved interpretability and reasoning capabilities.

Reddit r/LocalLLaMA44 d agofound 38 d ago#fable5#data#cot

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

The LEDGER dataset has been released, comprising 4,999 digitized corporate annual reports that include figures, tables, and narratives, designed for rigorous evaluation of long-context capabilities in financial reporting. It features 31 labeled financial KPIs and provides three evaluation benchmarks: a page-level KPI retrieval task, a conversational single-value lookup, and a full KPI extraction task, all supported by human OCR-quality annotations. This resource is significant for practitioners as it enables the development and benchmarking of models for grounded financial retrieval and extraction, addressing the complexities of interpreting long, numerically dense documents.

arXiv cs.CL45 d agofound 38 d ago#financial-retrieval#benchmark#corporate-reports

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Fin-RATE is a new benchmark designed for evaluating Large Language Models (LLMs) on U.S. SEC filings, addressing the need for a more comprehensive assessment of LLM performance in financial analysis. It simulates the workflows of financial analysts through three evaluation pathways: detail-oriented reasoning, cross-entity comparisons, and longitudinal tracking. Benchmarking 17 LLMs revealed significant performance drops—up to 18.60%—as tasks increased in complexity, highlighting issues such as hallucinations and mismatches that existing benchmarks do not adequately address, making it crucial for practitioners to understand the limitations of LLMs in real-world financial contexts.

arXiv cs.AI45 d agofound 39 d ago#llm#financial analysis#benchmark

MiniPIC: Flexible Position-Independent Caching in <100LOC

The article introduces Minimalistic Position-Independent Caching (MiniPIC), a lightweight design for vLLM that enhances retrieval-augmented workloads by utilizing a positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC enables efficient caching with fewer than 100 lines of code changes, supports multiple caching methods like Block-Attention and Prompt Cache, and achieves a 49% improvement in prefill throughput on the 2WikiMultihopQA benchmark, while significantly reducing time-to-first-token for cached spans. This development is crucial for practitioners as it allows for flexible caching strategies without extensive server modifications, optimizing performance in AI inference tasks.

arXiv cs.AI45 d agofound 39 d ago#retrieval-augmented generation#caching#inference

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

The study introduces a controlled diagnostic framework comparing Iterative Retrieval-Augmented Generation (RAG) against static RAG in multi-hop scientific question answering, utilizing the ChemKGMultiHopQA dataset. It benchmarks eleven state-of-the-art LLMs across three contexts: No Context, Gold Context, and Iterative RAG, revealing that Iterative RAG can outperform Gold Context by up to 25.6 percentage points, particularly benefiting models not fine-tuned for reasoning. This work underscores the importance of staged retrieval in enhancing model performance and provides insights for practitioners on optimizing RAG systems in scientific applications.

arXiv cs.AI45 d agofound 39 d ago#rag#multi-hop#question answering

When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval

The study presents a systematic evaluation of query embedding interpolation in multilingual dense retrieval, focusing on the mMARCO dataset. It finds that an optimal mixing ratio of parallel query translations significantly enhances retrieval performance, outperforming monolingual queries in 88 out of 105 cases, particularly when retrieving from non-English document indices. The results indicate that English serves as the most effective mixing partner, and the sensitivity to language mixing is predictable, providing insights for practitioners on optimizing multilingual retrieval systems.

arXiv cs.CL45 d agofound 38 d ago#multilingual#dense-retrieval#embedding