Continuous checkpointing in Orbax and MaxText maximizes I/O bandwidth utilization while minimizing training job reliability risks from hardware failures or preemptions. Unlike fixed-interval checkpointing, it asynchronously saves checkpoints as soon as the previous save completes, reducing the mean time lost on failure.
Sort: