Uber has decentralized its Hive data warehouse, migrating 16,000 datasets totaling over 10 petabytes using pointer-based federation. The migration ensures zero downtime, strict ACL enforcement, improv

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

Uber redesigned its Hive data warehouse by federating over 16,000 datasets totaling 10+ petabytes, moving from a monolithic instance to a decentralized, domain-specific architecture. The migration uses a pointer-based approach in the Hive Metastore to redirect datasets to new HDFS locations without duplicating data, ensuring zero downtime for analytics and ML pipelines. Four key components handle the process: Bootstrap Migrator, Realtime Synchronizer, Batch Synchronizer, and Recovery Orchestrator. The result is improved ACL enforcement, reduced noisy-neighbor effects, better governance, and reclamation of over 1 PB of HDFS space through removal of stale datasets.

Uber’s Hive Federation Decentralizes 16K Datasets and 10+ PB for Zero-Downtime Analytics at Scale