Canonical's Embedded AI approach packages local LLM inference directly into applications as a snap dependency, eliminating metered API costs, network latency, and data privacy concerns. Inference snaps bundle model weights, the inference runtime (llama.cpp, vLLM), hardware-specific optimizations, and an OpenAI-compatible HTTP API. A single `sudo snap install gemma3` command sets up a locally-running, hardware-optimized inference server. Two reference implementations demonstrate the pattern: a minimal chat app and a PDF summarizer that processes documents entirely on-device. The approach is best suited for privacy-sensitive workloads, latency-critical applications, heavily-used internal tools, and air-gapped environments.

9m read timeFrom ubuntu.com
Post cover image
Table of contents
The problem with remote AI servicesFrom AI services to local LLM inferenceInference snaps: installing AI like a packageThe reference implementationsWhen does local LLM inference make sense?Getting started

Sort: