Slack's data platform had accumulated 700+ SSH-based Airflow operators running jobs on AWS EMR clusters by 2024, creating security risks, operational fragility, and blocking infrastructure modernization. The team migrated all jobs to a REST-based architecture using Quarry, an internal job submission gateway, and YARN's Distributed Shell feature to handle arbitrary shell commands without SSH. The migration covered 8 data regions, 7 operator types, and 5 teams over 3 quarters with zero downtime. Key challenges included YARN virtual memory check failures (previously hidden by SSH), network topology mismatches across account boundaries, and multi-region coordination complexity. Results included elimination of SSH attack surface, improved job reliability (jobs survive Kubernetes pod restarts), better observability, and unblocking of Spark-on-Kubernetes migration.
Table of contents
How We Got HereThe Real Cost of SSHUnderstanding the Foundation: REST-Based Job SubmissionThe Breakthrough: YARN Distributed ShellThe Migration JourneyThe Challenges We HitThe ResultsWhat We LearnedAcknowledgmentsSort: