A guide to JIT checkpointing with Kubeflow Trainer on OpenShift AI

Just-in-time (JIT) checkpointing with the Kubeflow Training SDK on Red Hat OpenShift AI 3.4 EA2 lets distributed LLM training jobs save state the moment a termination signal arrives, preventing GPU time and cost loss from pod preemption or node failures. The guide covers configuring periodic checkpointing, JIT checkpointing, and automatic resume using two storage backends: PVC (simple, single-cluster) and S3 (portable, scalable). Key SDK parameters like enable_jit_checkpoint, PeriodicCheckpointConfig, and data_connection_name are explained with code examples. Best practices include maximizing GPUs per node, avoiding overly frequent checkpoints, planning for S3 storage spikes during DeepSpeed ZeRO-3 consolidation, and managing S3 retention via lifecycle policies. A complete Jupyter notebook example is available in the Red Hat AI examples repository.

#kubernetes

#openshift

May 21•9m read time•From developers.redhat.com

Table of contents

Prerequisites Set up your environment The training function Two storage backends: PVC and S3 How S3 checkpointing works JIT checkpointing in action Checkpointing best practices Learn more

Comment

Bookmark

Copy

Sort: