Research
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
The study investigates the limitations of mean opinion score (MOS) prediction models in text-to-speech (TTS) systems by analyzing their responses to controlled perturbations in speech quality, including acoustic degradation and prosodic errors. It reveals that while models effectively track acoustic fidelity, they fail to account for prosodic errors and exhibit biases in speaker characteristics, such as fundamental frequency (F0), that do not align with human listener perceptions. These findings underscore the need for improved metrics that better capture the nuances of speech quality beyond mere acoustic fidelity, informing future developments in TTS evaluation methodologies.
speech-assessmentllmqualityacoustic