ResearcharXiv cs.AI — 10 d ago

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

The paper presents Cross-modal Identity Mapping (CIM), a reinforcement learning framework designed to minimize information loss during image captioning in Large Vision-Language Models (LVLMs). CIM evaluates information loss through Gallery Representation Consistency and Query-gallery Image Relevance, leading to a 20% improvement in relation reasoning on the COCO-LN500 benchmark with the Qwen2.5-VL-7B model, outperforming traditional Supervised Fine-Tuning approaches. This advancement is significant for practitioners as it enhances the precision of image captions generated by LVLMs without requiring additional annotations.

image captioningreinforcement learninginformation lossrelevance 0.00 · engagement 0.00

Read at source ↗← all news