Research
DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching
DiFlow-TTS is a newly proposed zero-shot text-to-speech framework that utilizes discrete flow matching to improve generation quality and inference efficiency. It features a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that generates prosody and acoustic token streams simultaneously. This architecture addresses the latency issues of autoregressive models and the training constraints of diffusion-based methods, making it a significant advancement for practitioners in the TTS domain.
text-to-speechzero-shotdiscrete-flow