MultimodalarXiv cs.AI — 4 d ago

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP is a new multimodal-context-aware visual token patching framework designed to reduce hallucinations in video large multimodal models. It employs a lightweight Visual Token Patcher that predicts token-level replacement distributions, refining unreliable visual tokens with a dynamic global patch token, and utilizes information-guided rank calibration for effective training. The framework demonstrated a 50.60% improvement in F1 scores for Qwen3-VL-4B-Instruct and an 18.58% accuracy gain on ActivityNet-QA for Video-LLaVA-7B, making it a significant advancement for practitioners aiming to enhance the reliability of video understanding in AI applications.

multimodalhallucinationsvideo modelsrelevance 0.00 · engagement 0.00

Read at source ↗← all news