Research
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
The paper presents Cross-modal Identity Mapping (CIM), a reinforcement learning framework designed to minimize information loss during image captioning in Large Vision-Language Models (LVLMs). CIM evaluates information loss through Gallery Representation Consistency and Query-gallery Image Relevance, leading to a 20% improvement in relation reasoning on the COCO-LN500 benchmark with the Qwen2.5-VL-7B model, outperforming traditional Supervised Fine-Tuning approaches. This advancement is significant for practitioners as it enhances the precision of image captions generated by LVLMs without requiring additional annotations.
image captioningreinforcement learninginformation loss