Multimodal
FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
FlowEdit is a lifelong adaptation framework designed for frozen flow-matching text-to-speech (TTS) systems, enabling them to learn pronunciation corrections without retraining. It utilizes a Modern Hopfield Network for content-addressable episodic memory, optimizing token-level perturbations in the text embedding space based on corrective feedback. In benchmarks involving 312 multilingual proper nouns, FlowEdit achieved a 92.7% reduction in target-word Phoneme Error Rate compared to the zero-shot baseline, while maintaining general-speech quality, with corrections processed in about 15 seconds on a single GPU.
ttspronunciationadaptation