MultimodalarXiv cs.AI — 15 d ago

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

The article introduces Visual Attentive Prompting (VAP), a novel training-free perceptual adapter designed to enhance Vision-Language-Action (VLA) models for personalized commands by enabling top-down selective attention. VAP utilizes reference images as a non-parametric visual memory to ground user-specific objects through open-vocabulary detection, significantly improving performance in personalized manipulation tasks as demonstrated by new benchmarks, Personalized-SIMPLER and Personalized-VLABench. This advancement is crucial for practitioners as it enhances the ability of robotic systems to accurately identify and manipulate specific objects in real-world scenarios, thereby improving usability in personalized applications.

vision_language_actionpersonalizationroboticsrelevance 0.00 · engagement 0.00

Read at source ↗← all news