TPU 8t and TPU 8i technical deep dive

Google Cloud's eighth-generation TPUs introduce two specialized systems: TPU 8t for large-scale pre-training and TPU 8i for inference and reinforcement learning. TPU 8t features SparseCore for embedding workloads, native FP4 support, a new Virgo Network topology enabling up to 4x DCN bandwidth, TPUDirect RDMA/Storage for 10x faster data access, and scales to over 1 million chips in a single training cluster. TPU 8i is optimized for high-concurrency reasoning with 3x more on-chip SRAM, a new Collectives Acceleration Engine (CAE) reducing collective latency by 5x, and a Boardfly ICI topology inspired by Dragonfly that cuts network diameter from 16 to 7 hops for 50% latency improvement in all-to-all communication. Both chips deliver up to 2x better performance-per-watt versus the previous Ironwood generation, with TPU 8t achieving 2.7x training price-performance and TPU 8i achieving 80% inference price-performance improvement. The systems integrate with JAX, PyTorch, vLLM, XLA, and Pathways, with native PyTorch support now in preview.

#gcp

Apr 22•11m read time•From cloud.google.com

Table of contents

TPU 8: Specialized by design TPU 8t: The pre-training powerhouse

Comment

Bookmark

Copy

Sort: