Pinterest's ML platform team shares a detailed account of upgrading PyTorch from 2.1 to 2.6 in production with zero downtime. The post covers the full cross-stack effort: migrating GPU hosts from Ubuntu 20 to Ubuntu 24 DLAMI with CUDA 12.6 and Nvidia driver 570, bridging breaking LibTorch C++ API changes using a compile-time version macro, resolving TorchScript deadlocks and hangs by disabling JIT profiling mode and NVFuser, handling Caffe2 deprecation with a pinned legacy Docker image, and executing a phased multi-surface rollout. Post-upgrade production issues included DCGM metric loss caused by nv-hostengine resource contention and intermittent CUDA failures traced to a systemd cgroup driver conflict in the Nvidia Container Toolkit, both resolved with targeted configuration fixes.
Table of contents
IntroductionChallengesJourney to PyTorch 2.6Get Pinterest Engineering’s stories in your inboxBridging Breaking APIs and Deprecated Caffe2Production AftercareUncovering a Cgroup Driver GotchaWrap UpAcknowledgementSort: