Amazon SageMaker HyperPod now supports gang scheduling for distributed AI/ML training workloads on EKS-orchestrated clusters. Gang scheduling ensures all pods required for a distributed job are ready before training begins, preventing wasted compute from partial job runs and deadlocks caused by resource contention. If not all pods become ready within a configurable time window, the workload is pulled back and automatically requeued. Administrators can tune settings via the HyperPod Console, including pod readiness wait times, node failure handling, sequential workload admission, and retry scheduling. The feature is available across multiple AWS regions.

2m read timeFrom aws.amazon.com
Post cover image

Sort: