ResearcharXiv cs.AI — 7 d ago

Diffusion Transformer World-Action Model for AV Scene Prediction

The article presents the Diffusion Transformer (DiT), a compact latent world model designed for action-conditioned scene prediction in autonomous vehicles, capable of generating 256 x 256 frames up to 8 seconds ahead based on ego-actions. The model, comprising only 1.7 million parameters, demonstrates significant improvements in steering RMSE, achieving a 40% reduction compared to the best single-frame encoder, and outperforms traditional regression methods in perceptual quality, as evidenced by a KID score of 0.078 versus 0.375. This advancement is crucial for practitioners as it allows for more accurate and realistic scene predictions without reliance on real-world rollouts, enhancing planning and simulation capabilities in autonomous driving systems.

autonomous vehiclesscene predictionrelevance 0.00 · engagement 0.00

Read at source ↗← all news