RAG
When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation
The paper introduces Streaming Retrieval-Augmented Generation (Streaming RAG), which enhances user experience by issuing tool queries in parallel with user input to reduce perceived latency. It characterizes a concept called tool-intent stabilization, measuring when speculative queries converge on relevant results, and establishes a model-agnostic bound on tool latency savings based on user input rates. The findings indicate that at optimal conditions (600ms latency, 3 words/sec input), 73.9% of queries can significantly hide latency, providing insights for AI practitioners on optimizing query timing and tool integration in real-time applications.
streamingtool usellm