Multimodal
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining
X-Tokenizer is a novel multimodal action tokenizer designed to enhance Vision-Language-Action (VLA) models by providing a semantic interface for robot control. It utilizes a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture, where the first level employs Masked Action Modeling (MAM) for coarse motion intent, while deeper levels focus on fine-grained reconstruction. Pretrained on 2.4 million trajectories, X-Tokenizer demonstrates superior performance in multimodal grounding and long-horizon tasks, highlighting its role in improving action tokenization beyond simple compression and enhancing VLA model capabilities.
vlatokenizationrobotics