Google details how it has evolved its network infrastructure to meet the demands of large-scale AI workloads. The post covers three pillars: the fabric inside the AI Hypercomputer (including the new Virgo Network scale-out fabric linking up to 134,000 TPU 8t chips with 47 petabits/sec bandwidth), the fabric across campuses over WAN (with 10x traffic growth since 2020 and a new AI-native Cloud Interconnect scaling to petabit capacity), and the global network for AI inference (spanning 10M+ km of fiber, 43 cloud regions, 200+ edge locations). Key innovations include a decoupled three-domain campus architecture, autonomous fault detection and hang detection for training jobs, sub-millisecond telemetry for microburst detection, and multi-shard global network isolation for beyond-nines reliability.
Sort: