Multimodal
Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting
The article introduces Visual Attentive Prompting (VAP), a novel training-free perceptual adapter designed to enhance Vision-Language-Action (VLA) models for personalized commands by enabling top-down selective attention. VAP utilizes reference images as a non-parametric visual memory to ground user-specific objects through open-vocabulary detection, significantly improving performance in personalized manipulation tasks as demonstrated by new benchmarks, Personalized-SIMPLER and Personalized-VLABench. This advancement is crucial for practitioners as it enhances the ability of robotic systems to accurately identify and manipulate specific objects in real-world scenarios, thereby improving usability in personalized applications.
vision_language_actionpersonalizationrobotics