Google's TorchTPU is a new engineering stack enabling PyTorch to run natively on TPU hardware with minimal code changes. Built on an 'Eager First' philosophy using PyTorch's PrivateUse1 interface, it offers three execution modes: Debug Eager (synchronous, for debugging), Strict Eager (asynchronous, mirrors standard PyTorch), and Fused Eager (auto-fuses operations for 50–100%+ performance gains). For peak performance, it integrates with torch.compile via XLA as the backend compiler, translating PyTorch operators to StableHLO IR. TorchTPU supports DDP, FSDPv2, and DTensor for distributed workloads, and adds MPMD support to handle divergent execution across ranks — a key limitation of its predecessor PyTorch/XLA. Custom kernels via Pallas and JAX are supported. The 2026 roadmap includes bounded dynamism for dynamic shapes, precompiled kernel libraries, and broader ecosystem integration.
Table of contents
Architecting for Usability, Portability, and PerformanceEngineering the TorchTPU Stack: The Technical RealityThe Road Ahead: 2026 and BeyondSort: