Research
ArtNet: A JEPA-Like Articulatory Predictive Framework for Robust Zero-Shot Phoneme Recognition
ArtNet is a novel framework designed for zero-shot cross-lingual phoneme recognition, leveraging a structured feature prediction task based on articulatory features to improve acoustic robustness. The architecture incorporates an articulatory predictor that utilizes self-supervised learning (SSL) features alongside a variational information bottleneck (VIB) to mitigate language-specific variations. Experimental results show that ArtNet achieves a 20.56% relative reduction in phoneme error rate (PER) and 7.01% in phoneme feature error rate (PFER) across seven unseen languages, highlighting its potential for enhancing phoneme recognition in multilingual contexts.
phoneme recognitionzero-shotarticulatory featuresself-supervised learning