CodingarXiv cs.AI — 8 d ago

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS introduces a novel framework for text-to-speech synthesis by representing text as images, allowing models to leverage visual cues for improved language understanding. It utilizes a 2D convolutional layer to generate embeddings, which enhances robustness against unseen characters and orthographic variations without the need for embedding matrix expansion during fine-tuning. This approach demonstrates competitive performance against traditional methods, with faster convergence and effective zero-shot generalization, making it a significant advancement for practitioners in speech synthesis and cross-lingual applications.

text-to-speechvisual cueslanguage understandingrelevance 0.00 · engagement 0.00

Read at source ↗← all news