ResearcharXiv cs.AI — 9 d ago

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

The paper presents a study on emotional speech synthesis (ESS) using an enhanced FastSpeech 2 architecture, incorporating speaker embeddings and a prosody bottleneck. The proposed system effectively generates emotional speech for a single speaker and transfers speaking styles while preserving speaker identity from neutral data. This advancement is significant for practitioners as it addresses the challenge of expressiveness in text-to-speech systems, enhancing the naturalness and variability of synthesized speech.

speech-synthesisemotional-speechrelevance 0.00 · engagement 0.00

Read at source ↗← all news