CodingarXiv cs.AI — 16 d ago

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

The article introduces ZeSTA, a domain-conditioned training framework for zero-shot text-to-speech (ZS-TTS) aimed at enhancing personalized speech synthesis in low-resource settings. By employing a lightweight domain embedding and real-data oversampling, ZeSTA effectively mitigates speaker similarity degradation during fine-tuning without altering the base architecture. Experimental results on LibriTTS and an in-house dataset indicate that this approach improves speaker similarity while maintaining intelligibility and perceptual quality, making it a valuable technique for practitioners in personalized TTS applications.

ttsdata-augmentationspeech-synthesisrelevance 0.00 · engagement 0.00

Read at source ↗← all news