ModelsarXiv cs.CL — 8 d ago

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT is a new Romanized transcription-based text-to-speech (TTS) encoder that scales to 495 languages by unifying various writing systems into a shared Romanization representation, overcoming the limitations of traditional grapheme-to-phoneme (G2P) approaches. It incorporates a speech token prediction objective to improve phonetic fidelity and text-speech alignment, resulting in superior performance compared to existing text encoder baselines across diverse languages and conditions. This advancement is significant for practitioners as it enhances the accessibility and effectiveness of multilingual TTS systems.

ttsmultilingualtext-encoderrelevance 0.00 · engagement 0.00

Read at source ↗← all news