Inference
Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention
The paper introduces FAST-AR, a unified attention framework designed to enhance autoregressive video diffusion models by addressing memory and latency issues during inference. It incorporates three innovations: TempCache for compressing the key-value cache, AnnCA for accelerating cross-attention through frame-relevant token selection, and AnnSA for sparsifying self-attention by matching queries to semantically relevant keys. These improvements enable speedups of 5x to 10x while maintaining visual quality and stabilizing GPU memory usage, which is critical for practitioners developing long-form video synthesis and interactive applications.
videodiffusionattention