Uber successfully migrated 18,000 Hive ETL workflows generating 5 million monthly queries to Spark SQL, achieving 50% reduction in runtime and resource usage. The migration involved building three core services: Query Translation Service for converting HiveQL to Spark SQL, Data Validation Service for ensuring output consistency, and Automated Migration Service for orchestrating shadow testing. Key challenges included handling syntax differences, floating-point precision issues, non-deterministic functions, and performance gaps like file creation patterns. The team developed custom solutions including Hive-compatible behaviors in Spark, expression translations, and partition rebalancing to maintain data consistency while leveraging Spark's performance benefits.
Table of contents
MotivationArchitectureMigration StrategyAutomated Migration Service (AMS)Query Translation ServiceData Validation ServiceBridging the Gap Between Hive and SparkSort: