TrainingarXiv cs.AI — 15 d ago

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

The paper introduces a structural compression pipeline for Vision-Language-Action (VLA) models that allows for a reduction in model depth by up to 50% without the need for full-scale model loading. By leveraging Centered Kernel Alignment to identify and remove redundant layers, the approach enables a 40-50% decrease in training time and up to 30% faster inference, while maintaining or improving performance on benchmarks such as LIBERO, RoboCasa, and SimplerEnv. This finding suggests that VLA models can be optimized for efficiency, making them more feasible for real-time robotic applications.

fine-tuningvlacompressionrelevance 0.00 · engagement 0.00

Read at source ↗← all news