MultimodalarXiv cs.CL — 11 d ago

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

This study introduces the first systematic evaluation of multilingual capabilities in Vision-Language-Action (VLA) models, revealing a significant performance drop when models trained primarily on English instructions are tested with multilingual inputs. The authors extend existing benchmarks with translated instructions and analyze cross-lingual transfer behavior, finding that both instruction understanding and action execution suffer in non-English contexts. To address this multilingual gap, they propose a fine-tuning method called Multilingual Principal Component Alignment, which aligns multilingual representations to enhance performance across languages, offering a practical approach for practitioners aiming to build more robust multilingual VLA systems.

vision-languagemultilingualmodelsrelevance 0.00 · engagement 0.00

Read at source ↗← all news