JobSet is a new open source API designed to streamline the management of distributed ML training and high-performance computing (HPC) workloads on Kubernetes. It addresses limitations in existing Kubernetes jobs by allowing for features like multi-template pods, job groups, inter-pod communication, and startup sequencing. JobSet models workloads as groups of Kubernetes jobs, enhancing scheduling and lifecycle management. Key features include replicated jobs, automatic headless service creation, configurable success and failure policies, and integration with Kueue for efficient capacity management.
Sort: