Multimodal
Probing Low Frame Rate Degradation in Neural Audio Codecs
The paper investigates the degradation of performance in neural audio codecs operating at low frame rates, particularly focusing on autoregressive speech synthesis at rates of 12.5 Hz and below. It identifies that the previously observed quality cliff at 6.25 Hz is due to suboptimal training configurations rather than inherent limitations, allowing for smoother degradation of word error rates (WER) down to 1.6 Hz when training is adjusted. This insight suggests that efficiency gains from low frame rate codecs can be achieved with proper training strategies, which is critical for practitioners aiming to optimize audio synthesis models.
audioneural codecsspeech synthesis