Apache Hive™ on Apache Spark™ has been the preferred engine for ETL workloads at Uber. Hive on Spark supports a wide range of use cases across various verticals like compliance, financial reporting, planning, forecasting, fraud, and risk analysis. Before the migration, there were about 18,000 Hive ETL workflows generating around 5 million queries per month, contributing to significant percentage of Uber’s total Yarn usage. Additionally, Hive was used for interactive use cases, handling around 150,000 interactive queries monthly. This blog talks about our migration journey from Hive to Apache Spark SQL™ and the challenges faced on the way.

The Uber Engineering Blog offers insights, technical deep dives, and updates on the engineering challenges and solutions behind Uber's platform and services. Covering topics such as distributed systems, data infrastructure, and mobile development, the blog provides resources for developers interested in large-scale systems and real-world engineering problems. Developers can learn about Uber's technology stack, engineering culture, and best practices for building scalable and reliable systems.

Uber Engineering

Uber successfully migrated 18,000 Hive ETL workflows generating 5 million monthly queries to Spark SQL, achieving 50% reduction in runtime and resource usage. The migration involved building three core services: Query Translation Service for converting HiveQL to Spark SQL, Data Validation Service for ensuring output consistency, and Automated Migration Service for orchestrating shadow testing. Key challenges included handling syntax differences, floating-point precision issues, non-deterministic functions, and performance gaps like file creation patterns. The team developed custom solutions including Hive-compatible behaviors in Spark, expression translations, and partition rebalancing to maintain data consistency while leveraging Spark's performance benefits.

How Uber Migrated from Hive to Spark SQL for ETL Workloads