Gemma 4, Google DeepMind's latest open model family, is available for immediate deployment via vLLM and Red Hat AI Inference Server. The family spans four models (2B to 31B parameters), all supporting multimodal input (text, image, video), with the two smallest also handling audio. The 26B A4B model uses a Mixture-of-Experts architecture, activating only 3.8B parameters per forward pass for efficient inference. All models support thinking mode, native function calling, long context windows (128K–256K tokens), and 140+ languages under Apache 2.0. The guide provides step-by-step instructions for deploying the 26B A4B model using Podman and Red Hat AI Inference Server, including examples for chat, reasoning, multimodal, and function calling via the OpenAI-compatible API.

9m read timeFrom developers.redhat.com
Post cover image
Table of contents
What's new in Gemma 4The power of open: Use Gemma 4 on Day 0Get started using Red Hat AI Inference ServerExplore more

Sort: