Norway's National Library is building a sovereign Norwegian-language LLM using 2 PB of Huawei OceanStor Dorado all-flash storage as part of its AI training data pipeline. The project is driven by the absence of any commercial LLM trained on Norwegian language and culture. The library holds 20 PB of unique digitized content — books, newspapers, web pages, and broadcasts — and has exclusive rights to train on copyrighted Norwegian newspaper content. The main bottleneck is not compute but data quality, cleaning, and pipeline throughput. Data flows from a 60 PB preservation archive through an on-premises pipeline (Nvidia DGX H200, CPU cluster, Huawei flash arrays) before reaching Norway's Sigma2 Olivia national supercomputer for actual training. Key challenges include bridging high-latency archival storage with low-latency AI pipeline storage, building custom LLM evaluation tools for a language with two written forms and multiple dialects, and resolving governance questions around access and usage of a sovereign AI.
Sort: