Multimodal
Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming
A new method for multilingual word-level forced alignment has been introduced, utilizing a dual-representation alignment encoder that combines outputs from the Massively Multilingual Speech (MMS) model and a self-supervised phoneme boundary detector (UnSupSeg). The learned dynamic programming alignment decoder enhances word-boundary estimation, achieving superior performance over existing methods like the Montreal Forced Aligner (MFA) on TIMIT and Buckeye datasets, and demonstrating competitive results on unseen languages including Dutch, German, and Hebrew. This approach shows promise for scalable alignment across over 1100 languages supported by MMS, which is critical for practitioners developing multilingual speech processing systems.
self-supervisedalignmentmultilingual