Multimodal
OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation
The article introduces DRIVE-CHOREO, a novel LLM-choreographed multi-agent world model designed for multi-view driving video generation, addressing the challenges of heterogeneous control injection and post-hoc cross-view fusion. It utilizes three Qwen2.5-VL agents to create a unified symbolic interlingua for aligning language, geometry, and pixels, achieving state-of-the-art results on the nuScenes benchmark with a multi-view consistency score and a BEV mAP of 21.6. This approach enhances downstream tasks, as evidenced by a +2.4 NDS improvement for a detector trained on its synthetic data, making it significant for practitioners focused on improving autonomous driving systems.
drivingvideo generationworld model