I’ve yet to meet a developer that enjoys working with metered AI APIs. The need to pay for every API call in development works in direct opposition to the ethos of rapid iteration, and it’s easy for the costs to get out of hand. That’s why Canonical has created a different approach to building AI-powered  […]

The Ubuntu Blog provides updates, tutorials, and insights on the Ubuntu operating system and related projects. Covering topics such as Linux desktops, server administration, and cloud computing, the blog offers resources for developers and sysadmins working with Ubuntu. Developers can learn how to set up, configure, and optimize Ubuntu systems for development, deployment, and production environments by following the Ubuntu Blog.

Ubuntu

Canonical's Embedded AI approach packages local LLM inference directly into applications as a snap dependency, eliminating metered API costs, network latency, and data privacy concerns. Inference snaps bundle model weights, the inference runtime (llama.cpp, vLLM), hardware-specific optimizations, and an OpenAI-compatible HTTP API. A single `sudo snap install gemma3` command sets up a locally-running, hardware-optimized inference server. Two reference implementations demonstrate the pattern: a minimal chat app and a PDF summarizer that processes documents entirely on-device. The approach is best suited for privacy-sensitive workloads, latency-critical applications, heavily-used internal tools, and air-gapped environments.

Developing web apps with local LLM inference

Inference snaps: installing AI like a package

When does local LLM inference make sense?