ai-digest.dev
last updated 13 h ago
MultimodalarXiv cs.AI 7 d ago

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

The article introduces MACCO (MAsked Compositional Concept MOdeling), a novel framework designed to enhance compositional understanding in vision-language models (VLMs) like CLIP. By masking compositional concepts in one modality and reconstructing them using contextual information from the other, MACCO effectively captures cross-modal compositional structures. The framework includes auxiliary objectives for aligning masked features and has shown significant improvements in compositionality across five benchmarks, which also benefits text-to-image generation and multimodal large language models.

vision-languagecompositionalityclusteringrelevance 0.00 · engagement 0.00
Read at source ↗← all news
Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality — AI News Digest