Research
From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning
The article introduces BridgeVLM, a vision-language model that enhances visual causal reasoning by internalizing causal mechanisms through structured Causal Tokens and RAMP layers for causal message passing. It employs a unified training interface, M3S, to provide fine-grained causal supervision, resulting in a significant accuracy increase on intervention tasks—54.4% on CausalVLBench compared to 33.2% with traditional prompt-level supervision—and improved performance on Causal3D from 43.6% to 49.0%. This advancement is crucial for practitioners as it offers a more reliable framework for handling multi-image causal reasoning, enabling better intervention and counterfactual analysis in AI applications.
causal reasoningvision-languageml