MultimodalarXiv cs.AI — 14 d ago

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit is a lifelong adaptation framework designed for frozen flow-matching text-to-speech (TTS) systems, enabling them to learn pronunciation corrections without retraining. It utilizes a Modern Hopfield Network for content-addressable episodic memory, optimizing token-level perturbations in the text embedding space based on corrective feedback. In benchmarks involving 312 multilingual proper nouns, FlowEdit achieved a 92.7% reduction in target-word Phoneme Error Rate compared to the zero-shot baseline, while maintaining general-speech quality, with corrections processed in about 15 seconds on a single GPU.

ttspronunciationadaptationrelevance 0.00 · engagement 0.00

Read at source ↗← all news