Google's Gemma 4 family consists of four models — a 31B dense, a 26B-A4B Mixture of Experts, and two edge models (E4B and E2B) — all released under Apache 2.0 and natively multimodal. The 26B-A4B MoE stands out by activating only 3.8B parameters per token, delivering near-31B quality at much faster inference speeds. The edge models add native audio input and function calling, enabling fully offline voice agents on mobile. A notable caveat: Google withheld the Multi-Token Prediction heads from the public weights, limiting inference speed on the 31B — though community-trained EAGLE3 draft heads and traditional speculative decoding using smaller Gemma 4 models as drafts offer workarounds. For home lab local inference setups, the Gemma 4 family is presented as the most well-rounded option currently available.
Table of contents
Four models from the same DNAThe 26B MoE is the one I keep usingTool calling that's baked into the architectureSpeculative decoding gives the 31B a speed boostSort: