Halodoc's data engineering team shares how they integrated Apache YuniKorn into their EMR on EKS Spark platform to address three core problems: invisible deadlocks from uncoordinated Driver/Executor scheduling, multi-tenant resource contention, and cost spikes from burst-triggered node provisioning. The solution involved enabling gang scheduling (hard style) to treat Spark jobs as atomic units, configuring hierarchical queues with max resource limits to enforce fairness across workloads, and enabling bin-packing node sorting to consolidate workloads before triggering scale-outs. A fallback mechanism redirects jobs exceeding queue limits to the default Kubernetes scheduler to prevent indefinite pending states. Results include node memory utilization above 90%, CPU around 88%, and approximately 10% reduction in EC2 costs, with increased Spot instance usage as a secondary benefit.

17m read timeFrom blogs.halodoc.io
Post cover image
Table of contents
ReferenceAbout Halodoc

Sort: