Google has introduced TorchTPU, an engineering stack that enables PyTorch to run natively and efficiently on TPU hardware. Built on an 'Eager First' philosophy, it integrates with PyTorch's PrivateUse1 interface to provide familiar tensor semantics without subclasses or wrappers. Three eager execution modes are offered: Debug Eager for debugging, Strict Eager for async execution mirroring standard PyTorch behavior, and Fused Eager which automatically fuses operations on-the-fly for 50–100%+ performance gains. For peak performance, torch.compile routes through XLA as the backend compiler, translating PyTorch operators to StableHLO IR. TorchTPU supports DDP, FSDPv2, and DTensor for distributed workloads, and unlike its predecessor PyTorch/XLA, it handles MPMD (divergent multi-rank execution) gracefully. Custom kernels via Pallas and JAX are supported, with Helion kernel support in progress. The 2026 roadmap includes bounded dynamism for dynamic shapes, precompiled kernel libraries, and further ecosystem integrations.

7m read timeFrom developers.googleblog.com
Post cover image
Table of contents
Architecting for Usability, Portability, and PerformanceEngineering the TorchTPU Stack: The Technical RealityThe Road Ahead: 2026 and Beyond

Sort: