ResearcharXiv cs.CL — 16 d ago

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT introduces a novel framework for vision-and-language reasoning that iteratively decomposes complex tasks into sub-questions and answers, utilizing both large language models (LLMs) and vision-language models (VLMs). This approach addresses limitations of previous models by allowing for iterative refinement until the model is confident in its final answer, achieving significant improvements in zero-shot reasoning tasks—outperforming GPT-4-like models by 10% on VCR and 15% on SNLI-VE. This advancement is crucial for practitioners as it enhances the capability of LLMs in handling multi-step inferencing, potentially improving performance in real-world applications requiring nuanced reasoning.

vision-languagereasoningllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news