Discover more about what's new at AWS with SageMaker HyperPod now supports gang scheduling for distributed training workloads

AWS' platform is a leading cloud computing platform, offering insights into cloud infrastructure, services, and solutions for developers, businesses, and IT professionals. Through articles, whitepapers, and documentation, AWS offers insights into cloud architecture, serverless computing, and machine learning on AWS. Developers and architects can learn about AWS services like EC2, S3, Lambda, and more to build scalable, secure, and cost-effective cloud applications.

Amazon SageMaker HyperPod now supports gang scheduling for distributed AI/ML training workloads on EKS-orchestrated clusters. Gang scheduling ensures all pods required for a distributed job are ready before training begins, preventing wasted compute from partial job runs and deadlocks caused by resource contention. If not all pods become ready within a configurable time window, the workload is pulled back and automatically requeued. Administrators can tune settings via the HyperPod Console, including pod readiness wait times, node failure handling, sequential workload admission, and retry scheduling. The feature is available across multiple AWS regions.