Multimodal
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
The study presents insights into the information flow of audio and visual signals in Audio-Visual Large Language Models (AVLLMs), specifically examining models Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B parameters. It reveals that AVLLMs utilize a sequential information flow for audio-visual video inputs, while switching to parallel streams for interleaved items, and demonstrates that certain token types can be discarded post-integration with minimal impact on predictions, enhancing inference efficiency. These findings advance the understanding of multimodal interactions in LLMs, offering a foundation for improved interpretability and design in future models.
audio-visualinformation flowllm