NEST: Narrative Event Structures in Time for Long Video Understanding
The article introduces NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset comprising 1005 full-length movies, each annotated with 102 multimodal narrative events that integrate visual content, dialogue, and audio. The benchmark establishes baselines for various tasks including event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE), revealing challenging performance metrics with ETD below 8% and EL under 6%. This dataset is significant for practitioners as it enhances the understanding of narrative structures in long videos, facilitating advancements in vision-language models and their ability to process complex temporal relationships in multimedia content.