Token frugality (Tokensparsamkeit) is proposed as a key skill for developers using AI coding assistants. Two main strategies are covered: token compression via a proxy tool (rtk) that strips unnecessary words before sending to the LLM, and context optimization by limiting message history, tools, and MCP servers. The post also details a practical setup for running Claude Code against a local model using llama-cpp with a Mixture of Experts model (Qwen3.5-35B-A3B), covering pitfalls like Docker GPU access, context window limits in dense models, and parallelism settings. The final configuration uses Flash Attention, GPU layer offloading, and specific environment variables to point Claude Code at the local server.
Sort: