Explore how Salesforce optimized infrastructure cost at scale by evolving Kubernetes scheduler behavior to eliminate node fragmentation, and much more.

The Salesforce Engineering Blog offers a deep dive into Salesforce technologies, providing technical insights, best practices, and real-world examples. Developers can explore topics such as Salesforce development, architecture, and integrations, gaining  knowledge to build innovative solutions on the Salesforce platform. With contributions from Salesforce engineers and industry experts, the blog serves as a resource for the Salesforce community.

Salesforce Engineering

Salesforce's Data 360 team orchestrates nearly 2 million Spark applications daily on Kubernetes. The default kube-scheduler's LeastAllocated strategy caused node fragmentation by spreading Spark executors across many nodes, leaving idle capacity that Karpenter's reactive consolidation couldn't reclaim without disrupting running jobs. The team replaced the default scheduler with a custom MostAllocated bin-packing strategy using the NodeResourcesFit plugin, proactively stacking executors onto already-utilized nodes before provisioning new ones. This eliminated fragmentation at scheduling time, raised CPU and memory utilization by ~15%, cut compute infrastructure costs by 13%, and reduced EC2 node disruption rates by 50% by allowing autoscaling to terminate empty nodes rather than evict active pods.

How Data 360 Optimized Kubernetes Scheduling Architecture