Zero-Downtime PyTorch Upgrade in Production: Approaches, Pitfalls and Lessons

Pinterest's ML platform team shares a detailed account of upgrading PyTorch from 2.1 to 2.6 in production with zero downtime. The post covers the full cross-stack effort: migrating GPU hosts from Ubuntu 20 to Ubuntu 24 DLAMI with CUDA 12.6 and Nvidia driver 570, bridging breaking LibTorch C++ API changes using a compile-time version macro, resolving TorchScript deadlocks and hangs by disabling JIT profiling mode and NVFuser, handling Caffe2 deprecation with a pinned legacy Docker image, and executing a phased multi-surface rollout. Post-upgrade production issues included DCGM metric loss caused by nv-hostengine resource contention and intermittent CUDA failures traced to a systemd cgroup driver conflict in the Nvidia Container Toolkit, both resolved with targeted configuration fixes.

#machine-learning

#pytorch

#cuda

Mar 30•11m read time•From medium.com

Table of contents

Introduction Challenges Journey to PyTorch 2.6 Get Pinterest Engineering’s stories in your inbox Bridging Breaking APIs and Deprecated Caffe2 Production Aftercare Uncovering a Cgroup Driver Gotcha Wrap Up Acknowledgement

Comment

Bookmark

Copy

Sort: