TPUs achieve high throughput and energy efficiency through systolic arrays and ahead-of-time compilation with XLA. Unlike GPUs with thousands of cores and large HBM, TPUs use fewer compute units with larger on-chip memory buffers. They scale from single chips to multi-pod configurations using 3D torus topologies connected via Inter-Core Interconnect and Optical Circuit Switching. The design philosophy prioritizes predictable computation patterns and minimal memory operations, making them ideal for matrix multiplication-heavy workloads like neural network training and inference.

19m read timeFrom henryhmko.github.io
Post cover image
Table of contents
BackgroundTPU Design choice #1: Systolic Arrays + PipeliningTPU Design choice #2: Ahead of Time (AoT) Compilation + Less Reliance on CachesTray Level (a.k.a "Board"; 4 chips)Rack Level (4x4x4 chips)Full Pod Level (aka "Superpod"; 4096 chips for TPUv4)Multi-Pod Level (a.k.a "Multislice"; 4096+ chips for TPUv4)Putting diagrams to perspective in real-lifeAcknowledgementsReferences

Sort: