RAG
Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings
The article introduces SAGA, a framework that enhances vision encoders for retrieval by leveraging frozen multimodal large language models (MLLMs) to provide attribute-aware training signals. By utilizing Group Relative Policy Optimization (GRPO), SAGA replaces traditional scalar supervision with gradients that focus on specific visual attributes, resulting in improved embedding performance. The framework demonstrates a 3 to 6 point increase in Recall@1 across several benchmark datasets, making it a significant advancement for practitioners in zero-shot image retrieval tasks.
visual embeddingsattribute gradientsmllm