Small language models (under 10B parameters) are handling production workloads once thought to require much larger models, driven by advances in training methodology, data quality, and open source tooling. Inference costs for GPT-3.5-level performance dropped 280x between 2022 and 2024. The open source ecosystem now controls the full SLM stack: vLLM and llama.cpp for serving, Ollama for local inference, and LoRA/QLoRA/GPTQ/AWQ for fine-tuning and quantization. In agentic systems, small fine-tuned models outperform large frontier models on focused tasks like data extraction, classification, and routing. The article argues the real opportunity for open source builders lies in orchestration, fine-tuning workflows, evaluation harnesses, and deployment tooling rather than building frontier models.
Table of contents
Why AI in production is smaller, cheaper, and finally yours.Why small language models are winning nowWhy open source controls the small language model stackWhere small language models outperform large ones in productionWhat the small language model shift means for open source buildersMore from We Love Open SourceAbout the AuthorSort: