RAG
Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech
OmniSONAR is a new family of omnilingual sentence embedding models capable of integrating text, speech, code, and mathematical expressions into a unified semantic space, achieving state-of-the-art performance across thousands of languages. Utilizing progressive training and a two-stage teacher-student distillation framework, it halves cross-lingual similarity search error on the 200-language FLORES dataset and significantly outperforms NLLB-3B in translation tasks. This model is particularly relevant for practitioners as it facilitates high-quality cross-lingual and cross-modal applications, enabling effective multilingual processing and reducing the need for extensive language-specific resources.
sentence-embeddingscross-lingualomnilingual