Training
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies
FineVLA introduces a framework for fine-grained instruction alignment in Vision-Language-Action (VLA) models, addressing the limitations of existing datasets that provide only coarse goal-level language. The framework includes the construction of FineVLA-Data, a dataset of 47,159 fine-grained trajectories from 972,247 trajectories across 85K tasks, and a benchmark with 500 videos and 1,030 VQA questions. Experimental results show that fine-grained supervision enhances success rates in robotic tasks, achieving up to 86.8% in simulations and significantly improving control over factors such as pose and approach direction, thereby highlighting the importance of detailed execution instructions for effective robotic policy learning.
agentstrainingcontext