Research
Scaling Human and G2P Supervision for Robust Phonetic Transcription
The study presents findings on the effectiveness of Grapheme-to-Phoneme (G2P) models in automatic phonetic transcription, particularly when paired with human supervision. Utilizing an 80-hour benchmark dataset, the research identifies a critical threshold where G2P supervision is beneficial only with limited human annotations (20-30 hours), beyond which it may hinder performance. The authors highlight that ASR pretraining can significantly enhance transcription accuracy, achieving a 2.3x reduction in weighted phone feature error rates, especially for non-native and aphasic speech, indicating that reliance on G2P alone may lead to diminishing returns in robust phonetic transcription.
phonetic-transcriptionG2PASR