MultimodalarXiv cs.AI — 15 d ago

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

The paper introduces a novel approach for analyzing the influence of style captions on speech generation in text-to-speech (TTS) systems, specifically using cross-attention attribution applied to speech diffusion models, including CapSpeech-TTS. By adapting the DAAM framework, the authors provide insights through per-token heatmaps across 25 layers and 24 ODE steps, revealing that style tokens exhibit lower temporal variance and peak attention in early diffusion steps, which is critical for enhancing controllability in expressive TTS. This research is significant for practitioners as it elucidates the interaction between natural language input and acoustic output, potentially guiding improvements in TTS model design and performance.

text-to-speechstyle-captionedcross-attentionrelevance 0.00 · engagement 0.00

Read at source ↗← all news