Just-in-time (JIT) checkpointing with the Kubeflow Training SDK on Red Hat OpenShift AI 3.4 EA2 lets distributed LLM training jobs save state the moment a termination signal arrives, preventing GPU time and cost loss from pod preemption or node failures. The guide covers configuring periodic checkpointing, JIT checkpointing, and automatic resume using two storage backends: PVC (simple, single-cluster) and S3 (portable, scalable). Key SDK parameters like enable_jit_checkpoint, PeriodicCheckpointConfig, and data_connection_name are explained with code examples. Best practices include maximizing GPUs per node, avoiding overly frequent checkpoints, planning for S3 storage spikes during DeepSpeed ZeRO-3 consolidation, and managing S3 retention via lifecycle policies. A complete Jupyter notebook example is available in the Red Hat AI examples repository.

Table of contents
PrerequisitesSet up your environmentThe training functionTwo storage backends: PVC and S3How S3 checkpointing worksJIT checkpointing in actionCheckpointing best practicesLearn moreSort: