Training
OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
OpenLID-v3 has been released as an enhanced version of the OpenLID classifier, aimed at improving language identification for closely related languages. Key technical updates include the incorporation of additional training data, merging of problematic language variant clusters, and the introduction of a noise label. The model was evaluated against GlotLID across multiple benchmarks, revealing that ensemble methods can boost precision but may decrease coverage for low-resource languages, which is critical for practitioners working with multilingual datasets.
language-identificationopenlidnlp