Inference
Whisfusion: Parallel ASR Decoding with Masked Diffusion
Whisfusion introduces a novel non-autoregressive (NAR) ASR system that leverages a masked diffusion decoder trained on Whisper-large-v3 audio embeddings, achieving significant improvements in both accuracy and decoding speed. The model, trained on approximately 68k hours of multilingual speech, outperforms Whisper-large-v3 and Whisper-turbo in group-average accuracy across various benchmarks while decoding 4-5x faster, and it competes effectively with other leading models like Canary and Qwen3-ASR. This development highlights the potential of masked diffusion techniques to enhance ASR performance, offering a faster and more efficient alternative for practitioners in the field.
asrmasked diffusionlatency