MultimodalarXiv cs.CL — 7 d ago

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

The paper introduces MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a novel pipeline for generating narrated videos that explain complex scientific figures by grounding narrations to specific regions of the figures and their corresponding papers. Additionally, the authors release FigTalk, a benchmark featuring new metrics for sequential and component-level grounding, demonstrating that MINARD achieves superior performance in generating humanlike, paper-faithful narrations compared to existing methods in both automatic and human evaluations. This advancement is significant for practitioners as it enhances the interpretability of scientific content through multimodal AI, facilitating better understanding and communication of complex information.

video generationscientific figuresnarrationrelevance 0.00 · engagement 0.00

Read at source ↗← all news