This blog shares how we integrated YuniKorn into our Spark-on-EKS architecture, challenges & how it helps us better handle fluctuating workloads

HaloDoc is a healthcare technology platform that offers telemedicine services, online pharmacy, and health information resources to users in Indonesia. Through its platform, HaloDoc aims to improve access to healthcare services, empower patients to take control of their health, and facilitate collaboration between healthcare providers. Developers can learn from HaloDoc's innovative use of technology to address healthcare challenges, leveraging telemedicine, artificial intelligence, and data analytics to enhance patient care and outcomes.

Halodoc

Halodoc's data engineering team shares how they integrated Apache YuniKorn into their EMR on EKS Spark platform to address three core problems: invisible deadlocks from uncoordinated Driver/Executor scheduling, multi-tenant resource contention, and cost spikes from burst-triggered node provisioning. The solution involved enabling gang scheduling (hard style) to treat Spark jobs as atomic units, configuring hierarchical queues with max resource limits to enforce fairness across workloads, and enabling bin-packing node sorting to consolidate workloads before triggering scale-outs. A fallback mechanism redirects jobs exceeding queue limits to the default Kubernetes scheduler to prevent indefinite pending states. Results include node memory utilization above 90%, CPU around 88%, and approximately 10% reduction in EC2 costs, with increased Spot instance usage as a secondary benefit.

Implementing Apache Yunikorn on EMR on EKS at Halodoc