Multimodal
Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models
The article presents a psychoacoustic benchmark for evaluating spatial audio models based on the binaural masking level difference (BMLD). It assesses nine frozen audio models, including binaural and monaural self-supervised learning (SSL) models, revealing that dedicated binaural SSL models demonstrate effective phase encoding, while general-purpose models rely on spectro-temporal interference. This research highlights the importance of accurately encoding phase information for spatial audio applications, which is critical for practitioners developing localization systems in audio processing.
audiospatial-modelsself-supervised