ResearcharXiv cs.AI — 4 d ago

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

The paper presents a novel approach using BatchTopK sparse autoencoders (SAE) to interpret and steer the CosyVoice3 text-to-speech (TTS) model. The proposed modality-aware auto-interp pipeline enables the identification of interpretable features such as phonemes and speaker gender, allowing for targeted interventions that significantly alter TTS outputs, such as increasing laughter probability from 0.02 to 0.79. This work highlights the potential of SAE features for both interpretability and control in TTS systems, providing practitioners with new methods to manipulate and understand model behavior.

text-to-speechlanguage modelautoencodersrelevance 0.40 · engagement 0.00

Read at source ↗← all news