Models
MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild
MuVAP is a new causal multimodal framework designed for turn-taking prediction in multiparty interactions, utilizing a single audio stream and a single camera view. It introduces Role-Relative Projection to simplify modeling interactions among multiple speakers and is validated on the Audio-Visual Conversation Corpus, a novel 31-hour dataset of unedited conversations. The framework shows superior performance on Shift-Hold and next-speaker prediction tasks compared to existing baselines, making it a valuable tool for enhancing human-robot interaction capabilities.
turn-takingmultimodalaudio-visualMuVAP