Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

NVIDIA Groq 3 LPX is a rack-scale inference accelerator built around 256 NVIDIA Groq 3 LPU chips, designed to complement the Vera Rubin NVL72 GPU platform for low-latency, agentic AI workloads. The LPU architecture features deterministic compiler-orchestrated execution, 500 MB on-chip SRAM per chip, 150 TB/s on-chip memory bandwidth, and 96 high-speed C2C links per chip for predictable inter-chip communication. At rack scale, LPX delivers 315 PFLOPS, 128 GB total SRAM, and 640 TB/s scale-up bandwidth. The system implements attention-FFN disaggregation (AFD): Rubin GPUs handle prefill and decode attention over the KV cache, while LPX accelerates latency-sensitive FFN/MoE execution. NVIDIA Dynamo orchestrates this heterogeneous serving pipeline. LPX also serves as a fast draft-token generator for speculative decoding. Together, Vera Rubin NVL72 and LPX deliver up to 35x higher inference throughput per megawatt at 400 tokens/sec/user versus GB200 NVL72, and up to 10x more revenue opportunity for trillion-parameter models targeting premium interactive AI services.

#llm

#nvidia

#agentic-ai

#ai-inference

Mar 16•17m read time•From developer.nvidia.com

Table of contents

Introducing NVIDIA Groq 3 LPX The shift toward interactive inference Vera Rubin NVL72 meets LPX Unlocking intelligent agentic swarms What NVIDIA Groq 3 LPX enables for Developers Learn more

Comment

Bookmark

Copy

Sort: