Multimodal
Action with Visual Primitives
The article presents AVP (Action with Visual Primitives), a novel end-to-end architecture for Vision-Language-Action (VLA) models that separates instruction comprehension and motor control by utilizing visual-primitive tokens to condition an action expert. This design enhances learning efficiency and generalization, achieving a 37.04% improvement in success rates for robotic pick-and-place tasks compared to existing methods. The results indicate significant advancements in data efficiency, spatial-compositional generalization, and object-level transfer, making AVP a valuable approach for practitioners in robotic manipulation.
roboticsvisionaction