MultimodalarXiv cs.AI — 2 d ago

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

The study presents insights into the information flow of audio and visual signals in Audio-Visual Large Language Models (AVLLMs), specifically examining models Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B parameters. It reveals that AVLLMs utilize a sequential information flow for audio-visual video inputs, while switching to parallel streams for interleaved items, and demonstrates that certain token types can be discarded post-integration with minimal impact on predictions, enhancing inference efficiency. These findings advance the understanding of multimodal interactions in LLMs, offering a foundation for improved interpretability and design in future models.

audio-visualinformation flowllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news