Uber redesigned its Hive data warehouse by federating over 16,000 datasets totaling 10+ petabytes, moving from a monolithic instance to a decentralized, domain-specific architecture. The migration uses a pointer-based approach in the Hive Metastore to redirect datasets to new HDFS locations without duplicating data, ensuring zero downtime for analytics and ML pipelines. Four key components handle the process: Bootstrap Migrator, Realtime Synchronizer, Batch Synchronizer, and Recovery Orchestrator. The result is improved ACL enforcement, reduced noisy-neighbor effects, better governance, and reclamation of over 1 PB of HDFS space through removal of stale datasets.
Sort: