TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Google's TorchTPU is a new engineering stack enabling PyTorch to run natively on TPU hardware with minimal code changes. Built on an 'Eager First' philosophy using PyTorch's PrivateUse1 interface, it offers three execution modes: Debug Eager (synchronous, for debugging), Strict Eager (asynchronous, mirrors standard PyTorch), and Fused Eager (auto-fuses operations for 50–100%+ performance gains). For peak performance, it integrates with torch.compile via XLA as the backend compiler, translating PyTorch operators to StableHLO IR. TorchTPU supports DDP, FSDPv2, and DTensor for distributed workloads, and adds MPMD support to handle divergent execution across ranks — a key limitation of its predecessor PyTorch/XLA. Custom kernels via Pallas and JAX are supported. The 2026 roadmap includes bounded dynamism for dynamic shapes, precompiled kernel libraries, and broader ecosystem integration.

#machine-learning

#pytorch

Apr 07•7m read time•From developers.googleblog.com

Table of contents

Architecting for Usability, Portability, and Performance Engineering the TorchTPU Stack: The Technical Reality The Road Ahead: 2026 and Beyond

Comment

Bookmark

Copy

Sort: