A practical guide to self-hosting LLMs for agent workloads on a single GPU machine. Covers which benchmarks matter for agentic tasks (BFCL, τ-bench, SWE-bench, IFEval), quantization formats (BF16, GPTQ, AWQ, GGUF/K-quants) and their performance tradeoffs, GPU selection across AWS/GCP/Azure with pricing, KV cache sizing, and recommended models (Qwen3.5-27B, GLM-4.7 Flash, GPT-OSS-20B). Deployment patterns include Ollama for evaluation and vLLM for production with PagedAttention. Also covers zero-switch-cost migration from OpenAI and Anthropic APIs using vLLM's OpenAI-compatible endpoint and LiteLLM proxy, plus cost analysis showing self-hosting breaks even at roughly 40–100M tokens/month.

21m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Wait…why would I host my own LLM again?Why a single machine?Which Benchmarks Actually Matter?QuantizingHardwareModelsDeploymentZero switch costs?How much is this going to cost?Wrapping things up

Sort: