MultimodalarXiv cs.AI — 15 d ago

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA is a newly introduced Vision-Language-Action (VLA) foundation model designed for generalizable deformable manipulation tasks, specifically targeting the folding of clothing items across various conditions. It utilizes a Vision-Language Model (VLM) backbone augmented with an action expert that employs flow matching for continuous action generation, while optimizing efficiency by pruning transformer layers. Pre-trained on 5,000 hours of dual-arm demonstrations and fine-tuned using a human-in-the-loop Data Aggregation pipeline, DeMaVLA demonstrates competitive performance on RoboTwin 2.0 and strong results in real-world household folding tasks, underscoring its potential for scalable manipulation capabilities in robotics.

vision-languageroboticsrelevance 0.00 · engagement 0.00

Read at source ↗← all news