Multimodal
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation
DeMaVLA is a newly introduced Vision-Language-Action (VLA) foundation model designed for generalizable deformable manipulation tasks, specifically targeting the folding of clothing items across various conditions. It utilizes a Vision-Language Model (VLM) backbone augmented with an action expert that employs flow matching for continuous action generation, while optimizing efficiency by pruning transformer layers. Pre-trained on 5,000 hours of dual-arm demonstrations and fine-tuned using a human-in-the-loop Data Aggregation pipeline, DeMaVLA demonstrates competitive performance on RoboTwin 2.0 and strong results in real-world household folding tasks, underscoring its potential for scalable manipulation capabilities in robotics.
vision-languagerobotics