Multimodal
CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation
CineOrchestra is a unified video diffusion model designed for cinematic video generation that integrates multi-subject personalization, temporal control, multi-shot synthesis, and camera control into a single framework. It employs entity-centric conditioning primitives and two parameter-free coordinated rotary embeddings—temporal RoPE for consistent attention across varying durations and 2D entity-temporal cross-attention RoPE for entity-specific routing. CineOrchestra demonstrates superior performance on new benchmarks, outperforming six specialized models in dense caption following and shot-transition timing, making it a significant advancement for practitioners aiming to enhance control in video generation tasks.
videogenerationcinematic