Lyft migrated their ML platform from a fully Kubernetes-based architecture to a hybrid approach, using AWS SageMaker for offline training and batch workloads while keeping Kubernetes for online model serving. The transition reduced operational complexity by eliminating custom orchestration logic, background watchers, and cluster management overhead. Key technical challenges included replicating the Kubernetes runtime environment, building cross-platform Docker images, optimizing startup times with SOCI indexes and warm pools, and solving cross-cluster networking for Spark. The migration was designed to be invisible to users, requiring zero changes to ML code while significantly improving system reliability and reducing compute costs.
Sort: