TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Google has introduced TorchTPU, an engineering stack that enables PyTorch to run natively and efficiently on TPU hardware. Built on an 'Eager First' philosophy, it integrates with PyTorch's PrivateUse1 interface to provide familiar tensor semantics without subclasses or wrappers. Three eager execution modes are offered: Debug Eager for debugging, Strict Eager for async execution mirroring standard PyTorch behavior, and Fused Eager which automatically fuses operations on-the-fly for 50–100%+ performance gains. For peak performance, torch.compile routes through XLA as the backend compiler, translating PyTorch operators to StableHLO IR. TorchTPU supports DDP, FSDPv2, and DTensor for distributed workloads, and unlike its predecessor PyTorch/XLA, it handles MPMD (divergent multi-rank execution) gracefully. Custom kernels via Pallas and JAX are supported, with Helion kernel support in progress. The 2026 roadmap includes bounded dynamism for dynamic shapes, precompiled kernel libraries, and further ecosystem integrations.

#machine-learning

#pytorch

Apr 24•7m read time•From developers.googleblog.com

Table of contents

Architecting for Usability, Portability, and Performance Engineering the TorchTPU Stack: The Technical Reality The Road Ahead: 2026 and Beyond

Comment

Bookmark

Copy

Sort: