Multimodal
Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality
The article introduces MACCO (MAsked Compositional Concept MOdeling), a novel framework designed to enhance compositional understanding in vision-language models (VLMs) like CLIP. By masking compositional concepts in one modality and reconstructing them using contextual information from the other, MACCO effectively captures cross-modal compositional structures. The framework includes auxiliary objectives for aligning masked features and has shown significant improvements in compositionality across five benchmarks, which also benefits text-to-image generation and multimodal large language models.
vision-languagecompositionalityclustering