MultimodalarXiv cs.AI — 12 d ago

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

PearlVLA is a newly proposed Vision-Language-Action framework that enhances action planning by conducting deliberation in the latent space of a vision-language model. It employs a dual-branch architecture that separates visual grounding from iterative plan refinement, utilizing a lightweight frozen latent world model to optimize action generation with low latency. Empirical results on the LIBERO benchmark indicate that PearlVLA achieves state-of-the-art performance, making it significant for practitioners seeking efficient planning mechanisms in AI systems.

vision-languageaction-planningrelevance 0.00 · engagement 0.00

Read at source ↗← all news