Multimodal
VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
VinQA is a newly introduced dataset designed for long-form answer generation in multimodal document question answering, emphasizing the integration of visual elements such as tables and images alongside text. The study evaluates two encoding methods—Page Encoding and Modality Encoding—both of which employ unique citation mechanisms for visual elements. Results indicate that while proprietary models perform best overall, fine-tuning the open Qwen2.5-VL model on VinQA significantly enhances its performance, particularly with the robust Modality Encoding approach for complex documents, ultimately advancing the capabilities of multimodal large language models in real-world applications.
documentqavisual elements