A comprehensive comparison of modern LLM architectures from 2024-2025, examining key innovations across models like DeepSeek-V3, Llama 4, Gemma 3, and others. The analysis covers architectural improvements including Multi-Head Latent Attention (MLA) for memory efficiency, Mixture-of-Experts (MoE) for computational scaling, sliding window attention for reduced memory usage, and various normalization strategies. Despite seven years since the original GPT, most models retain similar foundational structures while incorporating incremental but significant optimizations for performance and efficiency.

27m read timeFrom sebastianraschka.com
Post cover image
Table of contents
1. DeepSeek V3/R12. OLMo 23. Gemma 34. Mistral Small 3.15. Llama 46. Qwen37. SmolLM38. Kimi 2

Sort: