InferencearXiv cs.CL — 2 d ago

Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

The paper presents a theoretical and empirical analysis advocating for greedy decoding in Visual Question Answering (VQA) tasks, challenging the prevalent use of stochastic sampling strategies in Multimodal LLMs (MLLMs). It establishes the conditions under which greedy decoding is optimal and demonstrates its superiority over stochastic methods through extensive benchmark testing. This work emphasizes the importance of task-specific decoding strategies, suggesting that practitioners should consider greedy decoding as a robust default for VQA to improve model calibration and predictive accuracy.

llmvisualquestionansweringgreedyrelevance 0.00 · engagement 0.00

Read at source ↗← all news