From monolith to global mesh: How Uber standardized ML at scale

Uber's Michelangelo ML platform evolved from a monolithic architecture to a cloud-native Kubernetes-based system to handle 30 million predictions per second. Key engineering solutions include 100+ custom CRDs with a MySQL-backed storage abstraction to bypass etcd scaling limits, a federated batch scheduling layer using PropagationPolicy CRDs to eliminate stranded compute across regional clusters, a Python-native workflow engine called Uniflow for ML lifecycle orchestration, and a multi-cloud compute mesh spanning GCP, AWS, Azure, and OCI. The platform now supports 40 million trips per day across 70+ countries. Future plans focus on autonomous self-healing through AI agents for debugging, intelligent CI/CD governance, and zero-toil framework upgrades.

#machine-learning

#kubernetes

#platform-engineering

#mlops

#multi-cloud

Mar 17•6m read time•From thenewstack.io

Table of contents

Introduction: the scaling wall The architecture: solving the “platform reality”Impact: production reality at Uber Future directions: the next operational frontier

Comment

Bookmark

Copy

Sort: