NVIDIA Groq 3 LPX is a rack-scale inference accelerator built around 256 NVIDIA Groq 3 LPU chips, designed to complement the Vera Rubin NVL72 GPU platform for low-latency, agentic AI workloads. The LPU architecture features deterministic compiler-orchestrated execution, 500 MB on-chip SRAM per chip, 150 TB/s on-chip memory bandwidth, and 96 high-speed C2C links per chip for predictable inter-chip communication. At rack scale, LPX delivers 315 PFLOPS, 128 GB total SRAM, and 640 TB/s scale-up bandwidth. The system implements attention-FFN disaggregation (AFD): Rubin GPUs handle prefill and decode attention over the KV cache, while LPX accelerates latency-sensitive FFN/MoE execution. NVIDIA Dynamo orchestrates this heterogeneous serving pipeline. LPX also serves as a fast draft-token generator for speculative decoding. Together, Vera Rubin NVL72 and LPX deliver up to 35x higher inference throughput per megawatt at 400 tokens/sec/user versus GB200 NVL72, and up to 10x more revenue opportunity for trillion-parameter models targeting premium interactive AI services.

17m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Introducing NVIDIA Groq 3 LPXThe shift toward interactive inferenceVera Rubin NVL72 meets LPXUnlocking intelligent agentic swarmsWhat NVIDIA Groq 3 LPX enables for DevelopersLearn more

Sort: