Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

A practical guide to self-hosting LLMs for agent workloads on a single GPU machine. Covers which benchmarks matter for agentic tasks (BFCL, τ-bench, SWE-bench, IFEval), quantization formats (BF16, GPTQ, AWQ, GGUF/K-quants) and their performance tradeoffs, GPU selection across AWS/GCP/Azure with pricing, KV cache sizing, and recommended models (Qwen3.5-27B, GLM-4.7 Flash, GPT-OSS-20B). Deployment patterns include Ollama for evaluation and vLLM for production with PagedAttention. Also covers zero-switch-cost migration from OpenAI and Anthropic APIs using vLLM's OpenAI-compatible endpoint and LiteLLM proxy, plus cost analysis showing self-hosting breaks even at roughly 40–100M tokens/month.

Self-Hosting Your First LLM

Wait…why would I host my own LLM again?