Multimodal
Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model
The paper introduces a plug-and-adapt method for Multimodal Coreference Resolution (MCR) that utilizes a pretrained alignment model to enhance performance without the need for extensive training on specific datasets. By pre-training a fine-grained alignment model on vision-language datasets and adapting it to MCR tasks through similarity aggregation, the approach achieves a 5.31% and 2.12% improvement in CoNLL F1 scores over state-of-the-art methods and popular Vision-Language Large Models (VLLMs) on the Coreference Image Narratives (CIN) benchmark. This method is significant for practitioners as it reduces the dependency on resource-intensive models and annotated data, facilitating easier deployment in real-world applications.
coreference-resolutionalignment