Coding
ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis
The article introduces ZeSTA, a domain-conditioned training framework for zero-shot text-to-speech (ZS-TTS) aimed at enhancing personalized speech synthesis in low-resource settings. By employing a lightweight domain embedding and real-data oversampling, ZeSTA effectively mitigates speaker similarity degradation during fine-tuning without altering the base architecture. Experimental results on LibriTTS and an in-house dataset indicate that this approach improves speaker similarity while maintaining intelligibility and perceptual quality, making it a valuable technique for practitioners in personalized TTS applications.
ttsdata-augmentationspeech-synthesis