Multimodal
BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension
BioVid is a novel autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, addressing the limitations of fixed-duration generation methods. It employs a Finite Scalar Quantization GAN (FSQ-R3GAN) for high-fidelity spatial reconstruction and a causal Transformer for autoregressive modeling, producing video clips that naturally align with the statistical properties of real behavioral data. In experiments on the NTU RGB+D dataset, BioVid achieved a Wasserstein-1 distance of 1.24 for generated length distributions, significantly outperforming baseline models, which is crucial for practitioners focusing on realistic and contextually accurate video generation.
video generationbiological behaviorautoregressive