Nvidia revealed at GTC that it will integrate Groq's language processing units (LPUs) into new LPX rack systems alongside its Vera Rubin GPU racks to dramatically accelerate AI inference. The architecture splits LLM inference into two stages: Rubin GPUs handle compute-heavy prefill, while 256 Groq 3 LPUs per rack handle the bandwidth-heavy decode phase, achieving token generation in the thousands per second per user. This enables pricing as high as $45 per million tokens. Each Groq 3 LPU offers 150 TB/s memory bandwidth but only 500 MB of on-chip SRAM, requiring multiple LPX racks ganged together for trillion-parameter models. The move effectively abandons Nvidia's earlier Rubin CPX prefill processor concept. AWS is pursuing a similar hybrid approach, pairing Trainium 3 with Cerebras WSE-3 ASICs. CUDA support for LPUs is not yet native.

5m read timeFrom go.theregister.com
Post cover image
Table of contents
Speed for decode

Sort: