Training LLMs requires periodic checkpoints. These full snapshots of model weights, optimizer states, and gradients are saved to storage so training can resume…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

LLM training checkpoints are massive (782 GB for a 70B model) and frequent, making them a significant cost driver. Synchronous checkpoint writes idle all GPUs during saves — at 128 B200s training a 405B model, this costs over $200,000/month. NVIDIA nvCOMP, a GPU-accelerated lossless compression library, can reduce checkpoint sizes by 21-29% using ZSTD or ANS codecs, cutting both storage fees and GPU idle time. The integration requires ~30 lines of Python as drop-in replacements for torch.save/torch.load. ZSTD (~16 GB/s) is optimal for slow shared storage (5-10 GB/s), while ANS (~181 GB/s) wins on faster GDS/NVMe setups (15+ GB/s). MoE models compress better (~1.40×) than dense transformers (~1.27×) due to gradient sparsity from expert routing. For a 405B dense model, ZSTD compression saves ~$56,000/month; DeepSeek-V3 scale MoE saves over $220,000/month.

Cut Checkpoint Costs with About 30 Lines of Python and NVIDIA nvCOMP

NVIDIA nvCOMP introduces GPU-accelerated compression