AI poses both an opportunity and an existential risk for minority languages. While tools like Abair (Irish), Aina (Catalan), and Project Vaani (Indian languages) show promise, 97% of the world's languages are now categorized as 'in danger.' Key problems include biased training data (93% of GPT-3's data was English), tokenizer inefficiency that multiplies compute costs for non-English words, energy infrastructure gaps in developing nations, and performance disparities (Claude Sonnet scores 94% in English reasoning vs. 76% in Yoruba). Solutions proposed include collecting high-quality native-speaker data, government and nonprofit support, small language models (SLMs), improved tokenizers, and accountability for cross-language performance metrics. A creative example from the Iñupiat people — a video game preserving their language — illustrates unconventional approaches to language preservation in the AI age.
Table of contents
Who gets in the ark?What are the problems?Improving models but more to doHow to fix itA closing thoughtIf you enjoyed this post, you might also like:Sort: