Welcome Gemma 4: Frontier multimodal intelligence on device

Google's Gemma 4 is a new family of open-source multimodal models (Apache 2 license) supporting image, text, audio, and video inputs. It comes in four sizes (E2B, E4B, 26B MoE, 31B dense) with 128k context windows. Key architectural innovations include Per-Layer Embeddings (PLE), Shared KV Cache, alternating local/global attention, and a variable aspect ratio vision encoder. The 31B dense model achieves an estimated LMArena score of 1452. Day-0 support is available across transformers, llama.cpp, MLX, transformers.js (WebGPU), and mistral.rs (Rust). Fine-tuning is supported via TRL, Unsloth Studio, and Vertex AI. Smaller models (E2B, E4B) also support audio input and can process video with audio.

#llm

#deep-learning

#multimodal

#gemma

Apr 20•25m read time•From huggingface.co

Table of contents

Table of Contents Overview of Capabilities and Architecture Multimodal Capabilities transformers Llama.cpp Plug in your local agent transformers.js MLX Fine-tuning for all Fine-tuning with TRL Fine-tuning with Unsloth Studio Try Gemma 4 Benchmark Results Acknowledgements