Agents
Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation
The paper presents a two-stage audio-only pipeline for multiparty turn-taking in spoken dialogue systems, addressing challenges with overlapping speech and rapid speaker transitions. It consists of a fast trigger for proposing end-of-turn times and a lightweight verifier for determining speaker shifts, achieving improved shift detection on the VoxConverse dataset. The introduction of diffusion-based background audio mixing as a data augmentation technique further enhances performance, making this approach relevant for practitioners developing more robust multiparty interaction systems.
turn-takingdialogue systemsaudiodiffusion