The Ultimate Guide to GPU Scaling With Karpenter

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A practical guide to GPU scaling on Amazon EKS using Karpenter, covering five common mistakes and eight best practices. Mistakes include using ASGs instead of Karpenter's EC2 Fleet API, over-constraining instance families, ignoring GPU count reporting bugs on P5/G6 instances, ignoring cold-start image pull costs, and mishandling voluntary vs. involuntary disruptions. Best practices cover Spot-to-Spot consolidation, pre-fetching images via EBS snapshots with Bottlerocket, GPU time-slicing with MIG for isolation, the do-not-disrupt annotation for long training jobs, gang scheduling with Kueue or NVIDIA KAI Scheduler, pinning AMIs to prevent driver drift, configuring EFA security groups to fix NCCL timeouts, and enabling prefix assignment mode in VPC CNI for higher pod density.

6m read timeFrom cloudnativenow.com
Post cover image
Table of contents
5 GPU Scaling Mistakes to Stop MakingStop Using ASGs for GPUsStop Over-Constraining Instance FamiliesStop Trusting Default Resource Reporting for P5 & G6 InstancesStop Paying Cold-Start Costs From Image PullsStop Ignoring Voluntary and Involuntary Disruption Types8 Best Practices for GPU ScalingUse Spot-to-Spot Consolidation (With Caution)Pre-Fetch Images With EBS SnapshotsUse Time-Slicing, But Know Its LimitsUse the do-not-disrupt AnnotationUse Gang Scheduling for Distributed TrainingPin AMIs To Prevent DriftConfigure EFA to Fix NCCL TimeoutsIncrease Pod Density Prefix Assignment ModeWhy Karpenter’s Bin-Packing Model Is Critical for GPU EfficiencyRelated

Sort: