Meta AI presents Omnilingual Machine Translation (OMT), the first MT system supporting over 1,600 languages, including endangered and marginalized ones. The system addresses the persistent generation bottleneck where models can understand many languages but fail to generate them reliably. OMT uses a comprehensive data strategy combining public multilingual corpora, manually curated MeDLEY bitext, synthetic backtranslation, and mining. Two model variants are introduced: OMT-LLaMA (decoder-only, built on LLaMA3 with retrieval-augmented translation) and OMT-NLLB (encoder-decoder architecture using OmniSONAR embeddings). Notably, 1B–8B parameter OMT models match or exceed a 70B LLM baseline on MT tasks, demonstrating a clear specialization advantage. New evaluation artifacts include BLASER 3, OmniTOX, and the BOUQuET multilingual evaluation dataset. The BOUQuET and Met-BOUQuET datasets are freely available.

9m read timeFrom ai.meta.com
Post cover image

Sort: