RAG
Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time
The paper introduces a retrieval-augmented vision-language-action (VLA) policy that allows for the adaptation to new tasks at test time without the need for per-task fine-tuning. By training a policy on paired demonstrations and utilizing a retrieval mechanism to access additional task-specific demonstrations, the method enables efficient cross-embodiment generalization, particularly enhancing performance in the Cosmos Policy framework. This approach reduces the computational burden associated with adapting models to new tasks, making it significant for practitioners looking to implement scalable and flexible AI systems in robotics and related fields.
retrievalvision-languagepolicy