The Ultimate Guide to GPU Scaling With Karpenter

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A practical guide to GPU scaling on Amazon EKS using Karpenter, covering five common mistakes and eight best practices. Mistakes include using ASGs instead of Karpenter's EC2 Fleet API, over-constraining instance families, ignoring GPU count reporting bugs on P5/G6 instances, ignoring cold-start image pull costs, and mishandling voluntary vs. involuntary disruptions. Best practices cover Spot-to-Spot consolidation, pre-fetching images via EBS snapshots with Bottlerocket, GPU time-slicing with MIG for isolation, the do-not-disrupt annotation for long training jobs, gang scheduling with Kueue or NVIDIA KAI Scheduler, pinning AMIs to prevent driver drift, configuring EFA security groups to fix NCCL timeouts, and enabling prefix assignment mode in VPC CNI for higher pod density.

#aws

#kubernetes

#gpu

Mar 17•6m read time•From cloudnativenow.com

Table of contents

5 GPU Scaling Mistakes to Stop Making Stop Using ASGs for GPUs Stop Over-Constraining Instance Families Stop Trusting Default Resource Reporting for P5 & G6 Instances Stop Paying Cold-Start Costs From Image Pulls Stop Ignoring Voluntary and Involuntary Disruption Types 8 Best Practices for GPU Scaling Use Spot-to-Spot Consolidation (With Caution)Pre-Fetch Images With EBS Snapshots Use Time-Slicing, But Know Its Limits Use the do-not-disrupt Annotation Use Gang Scheduling for Distributed Training Pin AMIs To Prevent Drift Configure EFA to Fix NCCL Timeouts Increase Pod Density Prefix Assignment Mode Why Karpenter’s Bin-Packing Model Is Critical for GPU Efficiency Related

Comment

Bookmark

Copy

Sort: