A conference talk by S. Maestri and A. Soldano covering what it takes to run coding agents (specifically Claude Code) locally on your own hardware. Topics include: why you might want local inference (privacy, compliance, cost control, API stability), the toolchain (llama.cpp + LM Studio), model selection using benchmarks from artificialanalysis.io, understanding GGUF quantization formats and their accuracy tradeoffs, memory requirements including KV cache scaling, prefill vs decode performance phases, key llama.cpp configuration parameters, and Claude Code environment tweaks needed for local use. A real-world experiment compared Claude Opus 4 (15 min, single pass) against a local Qwen 3.5 122B Q4 model (48 min planning + 1 hr implementation) for building a Java dashboard app, with the local model producing comparable but slower results requiring more iteration.

45m watch time

Sort: