Research
FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs
FutureOmni is introduced as the first benchmark for evaluating omni-modal future forecasting in Multimodal Large Language Models (MLLMs), focusing on the prediction of future events from audio-visual cues. It includes 919 videos and 1,034 multiple-choice QA pairs across 8 domains, revealing that existing models, including Gemini 3 Flash with a maximum accuracy of 64.8%, struggle particularly in speech-heavy contexts. The benchmark is complemented by a 7K-sample instruction-tuning dataset and an Omni-Modal Future Forecasting (OFF) training strategy, which together enhance future forecasting capabilities and generalization in MLLMs, with all resources publicly available for further research.
multimodalforecastingllm