Uber has made significant progress in scaling their AI/ML infrastructure, transitioning from on-prem to cloud infrastructure and optimizing existing infrastructure. They have implemented a unified federation layer for batch workloads, upgraded network bandwidth for training efficiency, and upgraded memory to improve GPU allocation rates. They are also building new infrastructure by evaluating price-performance ratios of cloud SKUs and improving LLM training efficiency through memory offload.

8m read timeFrom uber.com
Post cover image
Table of contents
Goal and Key MetricsOptimizing Existing On-prem InfrastructureBuilding New InfrastructureAcknowledgments

Sort: