Multimodal
PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space
PearlVLA is a newly proposed Vision-Language-Action framework that enhances action planning by conducting deliberation in the latent space of a vision-language model. It employs a dual-branch architecture that separates visual grounding from iterative plan refinement, utilizing a lightweight frozen latent world model to optimize action generation with low latency. Empirical results on the LIBERO benchmark indicate that PearlVLA achieves state-of-the-art performance, making it significant for practitioners seeking efficient planning mechanisms in AI systems.
vision-languageaction-planning