Multimodal
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation
The article introduces UniSinger, an end-to-end framework that integrates song generation and singing voice conversion (SVC) with accompaniment co-generation, addressing the lack of zero-shot speaker cloning in song generation and the absence of vocal-accompaniment synergy in SVC. Utilizing a multimodal diffusion transformer, UniSinger creates a unified speaker embedding space for transferring speaker representations, and employs a curriculum learning strategy with task-specific modality masking to optimize multi-task performance. This advancement is significant for practitioners as it enhances control over vocal timbre and accompaniment, paving the way for more sophisticated music production techniques.
song-generationvoice-conversionaccompaniment