MultimodalarXiv cs.AI — 11 d ago

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

The article introduces UniSinger, an end-to-end framework that integrates song generation and singing voice conversion (SVC) with accompaniment co-generation, addressing the lack of zero-shot speaker cloning in song generation and the absence of vocal-accompaniment synergy in SVC. Utilizing a multimodal diffusion transformer, UniSinger creates a unified speaker embedding space for transferring speaker representations, and employs a curriculum learning strategy with task-specific modality masking to optimize multi-task performance. This advancement is significant for practitioners as it enhances control over vocal timbre and accompaniment, paving the way for more sophisticated music production techniques.

song-generationvoice-conversionaccompanimentrelevance 0.00 · engagement 0.00

Read at source ↗← all news