Google's Gemma 4 is a new family of open-source multimodal models (Apache 2 license) supporting image, text, audio, and video inputs. It comes in four sizes (E2B, E4B, 26B MoE, 31B dense) with 128k context windows. Key architectural innovations include Per-Layer Embeddings (PLE), Shared KV Cache, alternating local/global attention, and a variable aspect ratio vision encoder. The 31B dense model achieves an estimated LMArena score of 1452. Day-0 support is available across transformers, llama.cpp, MLX, transformers.js (WebGPU), and mistral.rs (Rust). Fine-tuning is supported via TRL, Unsloth Studio, and Vertex AI. Smaller models (E2B, E4B) also support audio input and can process video with audio.

25m read timeFrom huggingface.co
Post cover image
Table of contents
Table of ContentsOverview of Capabilities and ArchitectureMultimodal CapabilitiestransformersLlama.cppPlug in your local agenttransformers.jsMLXFine-tuning for allFine-tuning with TRLFine-tuning with Unsloth StudioTry Gemma 4Benchmark ResultsAcknowledgements
3 Comments

Sort: