A practical benchmark comparing MiniMax2.5, Llama 3 (8B and 70B), Mistral Large 2 (123B), and Gemma 2 (9B and 27B) running locally via Ollama at Q4_K_M quantization on two GPU tiers (RTX 4090 and RTX 3060). Tests cover coding accuracy (Python and JavaScript Pass@1), reasoning, creative/chat quality, inference speed (tokens per second), and VRAM consumption. Key findings: MiniMax2.5 leads on JavaScript coding; Llama 3 70B tops Python accuracy and reasoning; Mistral Large 2 wins on chat quality but is slowest and most VRAM-hungry; Gemma 2 9B and Llama 3 8B are the only viable options for interactive use on 12 GB VRAM. Reproducible Python and Node.js benchmark scripts are provided, along with hardware-specific recommendations and a full comparison table.

23m read timeFrom sitepoint.com
Post cover image
Table of contents
LLM Benchmarks 2026 ComparisonTable of ContentsWhy Benchmark Local Models Yourself?Methodology — How We TestedThe Contenders — Model ProfilesResults — Coding PerformanceResults — Inference SpeedResults — Memory and VRAM UsageResults — Reasoning and Creative TasksThe Verdict — Choosing the Right ModelHow to Run These Benchmarks Yourself

Sort: