Just-in-time (JIT) checkpointing with the Kubeflow Training SDK on Red Hat OpenShift AI 3.4 EA2 lets distributed LLM training jobs save state the moment a termination signal arrives, preventing GPU time and cost loss from pod preemption or node failures. The guide covers configuring periodic checkpointing, JIT checkpointing, and automatic resume using two storage backends: PVC (simple, single-cluster) and S3 (portable, scalable). Key SDK parameters like enable_jit_checkpoint, PeriodicCheckpointConfig, and data_connection_name are explained with code examples. Best practices include maximizing GPUs per node, avoiding overly frequent checkpoints, planning for S3 storage spikes during DeepSpeed ZeRO-3 consolidation, and managing S3 retention via lifecycle policies. A complete Jupyter notebook example is available in the Red Hat AI examples repository.

9m read timeFrom developers.redhat.com
Post cover image
Table of contents
PrerequisitesSet up your environmentThe training functionTwo storage backends: PVC and S3How S3 checkpointing worksJIT checkpointing in actionCheckpointing best practicesLearn more

Sort: