Research
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
This article presents a new method for mitigating hallucinations in Large Vision-Language Models (LVLMs) by refining textual embeddings to better incorporate visual features. The proposed approach enhances multimodal reasoning by promoting a balanced attention distribution between text and visual inputs, leading to significant improvements in hallucination benchmarks, including +9.33% on MMVP-MLLM and +3% on HallusionBench. This advancement is crucial for practitioners as it addresses the prevalent issue of LLMs producing linguistically coherent but visually inaccurate outputs, thereby improving the reliability of LVLM applications.
hallucinationsvision-language-modelsembedding