Multimodal
From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation
This study evaluates four families of speech representations for 3D facial animation, focusing on segmental and semantic cues, acoustic reconstruction, and label-based spaces. The authors demonstrate that encoding phonetic classes enhances facial animation accuracy, leading to the development of an Audio Visual Text-to-Speech (AVTTS) pipeline that utilizes discrete representations to synchronize speech with 3D facial motion. This research is significant for practitioners as it provides insights into optimizing speech representation choices for improved facial animation quality in AI applications.
3d-animationspeech-representation