Research
Attention, not scale, drives human-AI alignment in multimodal language prediction
The study evaluates the performance of five state-of-the-art pretrained vision-language models against human participants in predicting upcoming words based on visual context. Results indicate that while model size had no significant effect on alignment with human predictability ratings, the use of transformer attention mechanisms enhanced this alignment, particularly when visual cues were informative, explaining up to 70% of the variance in human gaze patterns. This underscores the importance of selective attention in multimodal models for achieving human-like behavior in language prediction, suggesting that practitioners should focus on optimizing attention mechanisms rather than merely increasing model size.
human-ai alignmentmultimodallanguage prediction