A practical comparison of three LLM weight formats for local inference: FP16 (full-precision baseline), Q4_K_M GGUF (CPU/hybrid-friendly via llama.cpp), and AWQ (activation-aware GPU quantization). Covers the math behind memory requirements, how k-quants and AWQ differ in their precision-allocation strategies, benchmark data on file size, VRAM usage, tokens/sec, and perplexity degradation for a Llama 3 8B model. Includes a decision flowchart for choosing the right format based on hardware, code examples for loading models with Ollama and Hugging Face Transformers, step-by-step conversion scripts using llama.cpp and AutoAWQ, and common pitfalls such as format/runtime mismatches and the irreversibility of quantization.
Table of contents
Q4_K_M vs AWQ vs FP16 ComparisonTable of ContentsWhat Is Model Quantization and Why Does It Matter for Local LLMs?FP16: The Full-Fidelity BaselineQ4_K_M: The GGUF Sweet Spot for CPU and Hybrid InferenceAWQ: Activation-Aware Weight Quantization for GPU InferenceHead-to-Head Comparison: Benchmark Table and AnalysisDecision Flowchart: Which Format Should You Choose?Converting Between Formats: Practical Code ExamplesCommon Pitfalls and TipsSummary and RecommendationsSort: