Research
An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis
The paper presents a study on emotional speech synthesis (ESS) using an enhanced FastSpeech 2 architecture, incorporating speaker embeddings and a prosody bottleneck. The proposed system effectively generates emotional speech for a single speaker and transfers speaking styles while preserving speaker identity from neutral data. This advancement is significant for practitioners as it addresses the challenge of expressiveness in text-to-speech systems, enhancing the naturalness and variability of synthesized speech.
speech-synthesisemotional-speech