Inference
CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample
A benchmark comparing three CPU-only TTS models—Kokoro (82M params), Supertonic 3, and Inflect-Nano-v1 (4.6M params)—was conducted using Intel Xeon hardware with UTMOS scoring for audio quality. Results indicated that while Inflect-Nano achieved the fastest real-time factor (RTF) of 7.3x, its audio quality was rated poorly (MOS 3.48) due to issues with naturalness, whereas Kokoro provided the most human-like output (MOS 4.44) albeit at a slower RTF. The findings are significant for practitioners as they highlight trade-offs between speed and audio quality in TTS systems, with Kokoro being the most suitable for applications requiring natural-sounding speech.
ttsbenchmarkcpu